Open vvrein opened 1 week ago
Thanks for the report.
What you are saying makes sense. I haven't checked the code yet, but it looks like hashes are built for all files every time a file is uploaded. Otherwise uploading a file with a few bytes wouldn't take 10s.
Adding a cache would be a nice enhancement. On a technical level there are a few options, but since the server does not use a database, but only textfiles, these options are less performant.
I am not the owner, so I can't make any design decision. I currently also don't have much time to implement any of this. But maybe @orhun can add this feature in one of his live streaming coding sessions. ;-)
I think that the way of implementation should correlate with rustypaste
initial idea / philosophy.
Like is it created to serve thousands or tenth thousands or millions of files? Up to tenth thousands - storing file hashes in plain file should be ok. If more - sqlite with index may be better decision.
I wrote the simple script to test search speed in different conditions.
Here is the results of search 100 times for fourth and fourth from the end hash: | filename | entries | size | hash position | search time |
---|---|---|---|---|---|
hashes.txt | 1M | 85M | -4 | 13.583s | |
hashes.sqlite | 1M | 92M | -4 | 6.052s | |
hashes_indexed.sqlite | 1M | 163M | -4 | 0.0103s | |
------------------------ | --------- | ----------- | ------------------- | ----------------- | |
hashes.txt | 1M | 85M | 4 | 0.002s | |
hashes.sqlite | 1M | 92M | 4 | 6.007s | |
hashes_indexed.sqlite | 1M | 163M | 4 | 0.011s |
filename | entries | size | hash position | search time |
---|---|---|---|---|
hashes.txt | 100K | 8.5M | -4 | 1.378s |
hashes.sqlite | 100K | 9.1M | -4 | 0.641s |
hashes_indexed.sqlite | 100K | 17M | -4 | 0.010s |
------------------------ | --------- | ----------- | ------------------- | ----------------- |
hashes.txt | 100K | 8.5M | 4 | 0.001s |
hashes.sqlite | 100K | 9.1M | 4 | 0.655s |
hashes_indexed.sqlite | 100K | 17M | 4 | 0.010s |
filename | entries | size | hash position | search time |
---|---|---|---|---|
hashes.txt | 10K | 870K | -4 | 0.139s |
hashes.sqlite | 10K | 920K | -4 | 0.088s |
hashes_indexed.sqlite | 10K | 1.7M | -4 | 0.010s |
------------------------ | --------- | ----------- | ------------------- | ----------------- |
hashes.txt | 10K | 870K | 4 | 0.001s |
hashes.sqlite | 10K | 920K | 4 | 0.086s |
hashes_indexed.sqlite | 10K | 1.7M | 4 | 0.010s |
So the most consistent and position independent results are got with sqlite + index on hash
(sqlite without index is consistent too, but not so fast), but it have cons as @tessus said earlier.
But regarding that entries in cache should not only be created, but updated (upload another file with same name) and deleted (expired files and files deleted by user) too - it may be easier to do this with sqlite
but updated (upload another file with same name)
Nope. Files with the same name will be rejected from the server.
deleted (expired files and files deleted by user)
Yep, you are correct.
I love sqlite for clients in a single user env. I am not so happy with sqlite on a server though. I have never run perf tests, but an insert requires a lock, and I am not sure what isolation levels are supported by sqlite. What I am trying to say is that the bottleneck for uploading could then be the max transaction rate for inserts. There could be multiple threads trying to insert data. However, I doubt that there are thousands of users trying to upload thousands of files. (at the same time) Afaik rustypaste was not designed to handle that kind of traffic. Additionally inserting the hash could be done async. For the upload itself only the caluculated hash is important and reads can be done without a lock, if done via an uncommitted read. (But that means that there could be duplicate files in certain situations). Then again, one could also just insert the hash. If it fails, it means the file exists already. No reads necessary at all. But then we are bound by insert perf. A lot of possible ways to implement this.... ;-) And as always there are trade-offs.
Anyway, let's see what orhun thinks about this.
Hi! Thank you for the great tool ❤️ I started to use it recently, by migrating from other pastebin, and as far as I can see, rustypaste with
duplicate_files = false
tries to hash each file in upload folder to guess if file which to be uploaded already exists.Here is info about my existing uploads:
Uploaded file:
Uploading time with
duplicate_files = true
:Uploading time with
duplicate_files = false
:I've added some random large files with
dd if=/dev/urandom of=largefile bs=1M count=...
and summarized in table:Upload time mostly depends on total files size, files count - unless reached a few millions - should not impact drastically.
I think this is a really great feature, but with current implementation it is prone to enlarge uploading time as file size and count increase, so maybe adding simple cache mechanism, like storing file hashes in memory or in file is worth implementing.
rustypaste version: built from source bf6dd31cb94dc3da311e7aee041c36f7b58e5123 os:
arch linux
kernel:6.10.10-arch1-1
processor:Intel(R) Xeon(R) CPU E3-1230 v6 @ 3.50GHz
Unfortunately I have no experience with rust, so may help only with testing and debugging :)