vvrein commented 1 week ago

Hi! Thank you for the great tool ❤️ I started to use it recently, by migrating from other pastebin, and as far as I can see, rustypaste with duplicate_files = false tries to hash each file in upload folder to guess if file which to be uploaded already exists.

Here is info about my existing uploads:

# ls -lha data/files/ | wc -l
727
# du -hs data/files/
3.9G    .data/files/

Uploaded file:

$ ls -lha favicon-152.png
-rw-r--r-- 1 vrein vrein 7.7K Dec 20  2020 favicon-152.png

Uploading time with duplicate_files = true:

$ time curl http://127.0.0.1:8880/ -F "file=@favicon-152.png"
http://127.0.0.1:8880/favicon-152.AqlQ6JKAp9.png

real    0m0.010s
user    0m0.006s
sys     0m0.003s

Uploading time with duplicate_files = false:

$ time curl http://127.0.0.1:8880/ -F "file=@favicon-152.png"
http://127.0.0.1:8880/favicon-152.AqlQ6JKAp9.png

real    0m10.411s
user    0m0.007s
sys     0m0.003s

I've added some random large files with dd if=/dev/urandom of=largefile bs=1M count=... and summarized in table:

total files count	total files size	uploading time
727	3.9G	10.411s
728 (+1 1G)	4.9G	13.137s
729 (+1x2G +1x1G)	6.9G	19.254s
3730 (+3k 1M)	6.9G	18.403s

Upload time mostly depends on total files size, files count - unless reached a few millions - should not impact drastically.

I think this is a really great feature, but with current implementation it is prone to enlarge uploading time as file size and count increase, so maybe adding simple cache mechanism, like storing file hashes in memory or in file is worth implementing.

rustypaste version: built from source bf6dd31cb94dc3da311e7aee041c36f7b58e5123 os: arch linux kernel: 6.10.10-arch1-1 processor: Intel(R) Xeon(R) CPU E3-1230 v6 @ 3.50GHz

Unfortunately I have no experience with rust, so may help only with testing and debugging :)

tessus commented 1 week ago

Thanks for the report.

What you are saying makes sense. I haven't checked the code yet, but it looks like hashes are built for all files every time a file is uploaded. Otherwise uploading a file with a few bytes wouldn't take 10s.

Adding a cache would be a nice enhancement. On a technical level there are a few options, but since the server does not use a database, but only textfiles, these options are less performant.

hash is stored in extended attributes

Pros

fairly easy to implement

Contras

the exended attributes of all files have to be read on every upload
extended attributes are not available on Windows

hashes are stored in text file with mapping to filename

Pros

only one file has to be read
available on Windows
fairly easy to implement

Contras

depending on the number of files on the server, the file can have many lines. but even with a very large hash-mapping file, the processing would probably be sub-second.

use of sqlite or any other engine that allows indexed or hashed access

Pros

fairly easy to implement
most performant

Contras

additional dependecy

I am not the owner, so I can't make any design decision. I currently also don't have much time to implement any of this. But maybe @orhun can add this feature in one of his live streaming coding sessions. ;-)

vvrein commented 1 week ago

I think that the way of implementation should correlate with rustypaste initial idea / philosophy.
Like is it created to serve thousands or tenth thousands or millions of files? Up to tenth thousands - storing file hashes in plain file should be ok. If more - sqlite with index may be better decision.

I wrote the simple script to test search speed in different conditions.

hashes.py

```python import os import random import hashlib import string import sqlite3 import shutil import timeit HASH_FILE="hashes.txt" HASH_DB="hashes.sqlite" HASH_BYTES_DB="hashes_bytes.sqlite" HASH_INDEXED_DB="hashes_indexed.sqlite" FILENAME_LENGTH = 15 HASHES_COUNT = 1*10**6 def getrandline(): rb = random.randbytes(10) line = hashlib.sha256(rb).hexdigest() \ + " " * 5 \ + ''.join(random.choices(string.ascii_letters, k=FILENAME_LENGTH)) \ + ".txt" return line # create hash text file with 1M records (85Mb) if not os.path.exists(HASH_FILE): with open(HASH_FILE, 'a') as f: for _ in range(0, HASHES_COUNT): f.write(getrandline() + "\n") # store hashes as text in sqlite (92Mb) if not os.path.exists(HASH_DB): con = sqlite3.connect(HASH_DB) cur = con.cursor() cur.execute("CREATE TABLE hashes(hash, filename)") with open(HASH_FILE, "r") as f: for line in f: _hash, _filename = line.split() cur.execute("insert into hashes values (?,?)", (_hash, _filename)) con.commit() con.close() # try to store hashes as bytes. This does not works, since values stored as text in sqlite (92Mb) # size is the same as in previous variant if not os.path.exists(HASH_BYTES_DB): con = sqlite3.connect(HASH_BYTES_DB) cur = con.cursor() cur.execute("CREATE TABLE hashes(hash blob, filename blob)") with open(HASH_FILE, "r") as f: for line in f: _hash, _filename = line.split() cur.execute("insert into hashes values (?,?)", (_hash.encode(), _filename.encode())) con.commit() con.close() # store hashes as text in sqlite, gives x1.77 overhead to size (163Mb) if not os.path.exists(HASH_INDEXED_DB): shutil.copy(HASH_DB, HASH_INDEXED_DB) con = sqlite3.connect(HASH_INDEXED_DB) cur = con.cursor() cur.execute("create index hash_idx on hashes (hash)") con.commit() con.close() search_in_file = """ def search_in_file(file, search): with open(file, "r") as f: for line in f: if search in line: return line search_in_file(HASH_FILE, searching_for) """ search_in_db = """ con = sqlite3.connect(HASH_DB) def search_in_db(con, search): cur = con.cursor() res = cur.execute("select * from hashes where hash = ?", (search,)) return res.fetchone() search_in_db(con, searching_for) con.close() """ search_in_indexed_db = """ con = sqlite3.connect(HASH_INDEXED_DB) def search_in_db(con, search): cur = con.cursor() res = cur.execute("select * from hashes where hash = ?", (search,)) return res.fetchone() search_in_db(con, searching_for) con.close() """ searching_for = "some_hash_here" num = 100 print(f"Search for hash {searching_for} {num} times in file " + str(timeit.timeit(search_in_file, number=num, globals=globals()))) print(f"Search for hash {searching_for} {num} times in sqlite db " + str(timeit.timeit(search_in_db, number=num, globals=globals()))) print(f"Search for hash {searching_for} {num} times in sqlite indexed db "+ str(timeit.timeit(search_in_indexed_db, number=num, globals=globals()))) ```

Here is the results of search 100 times for fourth and fourth from the end hash:	filename	entries	size	hash position
hashes.txt	1M	85M	-4	13.583s
hashes.sqlite	1M	92M	-4	6.052s
hashes_indexed.sqlite	1M	163M	-4	0.0103s
------------------------	---------	-----------	-------------------	-----------------
hashes.txt	1M	85M	4	0.002s
hashes.sqlite	1M	92M	4	6.007s
hashes_indexed.sqlite	1M	163M	4	0.011s

filename	entries	size	hash position	search time
hashes.txt	100K	8.5M	-4	1.378s
hashes.sqlite	100K	9.1M	-4	0.641s
hashes_indexed.sqlite	100K	17M	-4	0.010s
------------------------	---------	-----------	-------------------	-----------------
hashes.txt	100K	8.5M	4	0.001s
hashes.sqlite	100K	9.1M	4	0.655s
hashes_indexed.sqlite	100K	17M	4	0.010s

filename	entries	size	hash position	search time
hashes.txt	10K	870K	-4	0.139s
hashes.sqlite	10K	920K	-4	0.088s
hashes_indexed.sqlite	10K	1.7M	-4	0.010s
------------------------	---------	-----------	-------------------	-----------------
hashes.txt	10K	870K	4	0.001s
hashes.sqlite	10K	920K	4	0.086s
hashes_indexed.sqlite	10K	1.7M	4	0.010s

So the most consistent and position independent results are got with sqlite + index on hash (sqlite without index is consistent too, but not so fast), but it have cons as @tessus said earlier.

vvrein commented 1 week ago

But regarding that entries in cache should not only be created, but updated (upload another file with same name) and deleted (expired files and files deleted by user) too - it may be easier to do this with sqlite

tessus commented 1 week ago

but updated (upload another file with same name)

Nope. Files with the same name will be rejected from the server.

deleted (expired files and files deleted by user)

Yep, you are correct.

I love sqlite for clients in a single user env. I am not so happy with sqlite on a server though. I have never run perf tests, but an insert requires a lock, and I am not sure what isolation levels are supported by sqlite. What I am trying to say is that the bottleneck for uploading could then be the max transaction rate for inserts. There could be multiple threads trying to insert data. However, I doubt that there are thousands of users trying to upload thousands of files. (at the same time) Afaik rustypaste was not designed to handle that kind of traffic. Additionally inserting the hash could be done async. For the upload itself only the caluculated hash is important and reads can be done without a lock, if done via an uncommitted read. (But that means that there could be duplicate files in certain situations). Then again, one could also just insert the hash. If it fails, it means the file exists already. No reads necessary at all. But then we are bound by insert perf. A lot of possible ways to implement this.... ;-) And as always there are trade-offs.

Anyway, let's see what orhun thinks about this.

orhun / rustypaste

Upload goes too long with duplicate_files = false. Maybe add hashes cache? #367

hash is stored in extended attributes

Pros

Contras

hashes are stored in text file with mapping to filename

Pros

Contras

use of sqlite or any other engine that allows indexed or hashed access

Pros

Contras