rejetto / hfs

HFS is a web file server to run on your computer. Share folders or even a single file thanks to the virtual file system.
GNU General Public License v3.0
2.13k stars 209 forks source link

Get md5 checksum #591

Closed made1990 closed 4 months ago

made1990 commented 4 months ago

Is it possible to get the md5 checksum of a file? Either directly after uploading (using PUT) as a return value or as part of an http GET using the api.

rejetto commented 4 months ago

no such feature yet, but i guess i'll make a plugin soon. to design its features, i'd need you to think if you just need API or also GUI. What do you use these checksums for?

rejetto commented 4 months ago

for the api part, I made some research, and I could add this header in for PUT, POST, GET, HEAD Digest: md5=...

one could use HEAD to get the md5 without downloading. Does it sound good?

made1990 commented 4 months ago

for the api part, I made some research, and I could add this header in for PUT, POST, GET, HEAD Digest: md5=...

one could use HEAD to get the md5 without downloading. Does it sound good?

That sounds perfect. I only need it via API (PUT, GET), not for GUI I would need it to verify if uploaded files are 100% identical against original file.

rejetto commented 4 months ago

would you say you need md5 only for the files you upload? because i'm realizing that to always provide md5 for files for which it was not calculated before seems needlessly heavy. I would also offer it in case you append ?get=md5 to any file's url.

made1990 commented 4 months ago

would you say you need md5 only for the files you upload? because i'm realizing that to always provide md5 for files for which it was not calculated before seems needlessly heavy. I would also offer it in case you append ?get=md5 to any file's url.

Both cases would be great, but md5 only for newly uploaded files would be enough if is easier.

rejetto commented 4 months ago

are you willing to use the value just as the upload finishes, or also later?

made1990 commented 4 months ago

are you willing to use the value just as the upload finishes, or also later?

As the upload finishes is enough. I guess otherwise it will be some overhead to save the information somewhere , am i right?

rejetto commented 4 months ago

some overhead, yes, but not necessarily a problem. i already programmed it but i'm still undecided how to "bundle" it. It would help to have more insight on the need. I'm surprised you are interested in detecting a corruption during an http upload, as for what i know, it's extremely unlikely to happen, and HFS is not subject to interrupted uploads since it sets the final filename only after the end. What do you think about it?

made1990 commented 4 months ago

I agree that hfs and the http protocol itself bring some functionallity to prevent faulty uploads or file corruption. Still network can be interrupted or similiar. I am using HFS in a semi-professional environment and my users ask for a way to ensure integrity of the files that are uploaded so the idea came up if the md5 can be returned after upload finished to compare with original md5.

rejetto commented 4 months ago

i'm willing to offer the functionality, but as i told you, an interrupted upload in HFS will have the word $upload in the name, so you cannot be mistaken

rejetto commented 4 months ago

while i still have to decide how to introduce md5 in HFS, i made it possible for a simple script to do it. The script uses new things that i'm about to publish, to read the incoming stream, so that you don't need to re-read the file from the disk after the upload is finished, especially good if the file is big. I also wrote another script that does the re-reading instead, and published all in the documentation, as an example https://github.com/rejetto/hfs/wiki/Middlewares#calculate-md5-on-uploads

If you are willing to test it, I can give you a preview version, but I need to know if you will run hfs with npx or what operating system.

made1990 commented 4 months ago

Sounds good, a test version would be great. I am running the npm version on Windows (it runs as a service on windows)

rejetto commented 4 months ago

i decided to publish the version in the meantime. the version you need is 0.53.0-alpha5. with npm or npx you need to specify hfs@beta instead of just hfs, to get it.

so, you used these instructions to set up your service?

made1990 commented 4 months ago

so, you used these instructions to set up your service?

Correct

rejetto commented 4 months ago

i'd like to find an "npx" way of making a service on windows, similarly to linux, so to make update easier (just by restarting the service). and don't forget to give me a feedback on the md5.

made1990 commented 4 months ago

Yeah, the update process for the windows service version of HFS is a bit inconvienient, but still doable.

Sure, will do testing of the md5 thing next week when I am back at the system :) Just to make sure, the code you documented under https://github.com/rejetto/hfs/wiki/Middlewares#calculate-md5-on-uploads needs to be added to the server code part of Options in the Admin gut, right? And that should do the trick?

rejetto commented 4 months ago

Correct

made1990 commented 4 months ago

Do I need to add something to my PUT command to get the md5 in return?

rejetto commented 4 months ago

nope

made1990 commented 4 months ago

Hm. It simply givs me an empty bracket es return: {} My command is: curl -X PUT https://my-url.com/myfolder/file1.txt -H "Authorization: Basic XXXXX* -d "Content of file"

rejetto commented 4 months ago

you are looking at the body, while the md5 is in a header

made1990 commented 4 months ago

When I use Method 1: calculate by reading file after it has been written the file is written correctly, but an error is returned: curl: (56) Failure when receiving data from the peer

When I use Method 2: processing incoming stream then the file is not even written correctly. It remains in the status with the hfs$uploadprefix.

rejetto commented 4 months ago

what hfs version are you using?

made1990 commented 4 months ago

0.53.0_alpha5

rejetto commented 4 months ago

ok let me check

rejetto commented 4 months ago

i just tried with your command, and got this using alpha5 and method 1. i'm not sure what's different on your side.

image
rejetto commented 4 months ago

do you get the same error WITHOUT the server code?

made1990 commented 4 months ago

Without the server code , everything is working normally.

made1990 commented 4 months ago

image

rejetto commented 4 months ago
  1. does it break only the upload and the rest is working?
  2. did you copy the script without any change?
  3. are you accessing hfs directly or through a proxy?
  4. i see port 443. Does that happen with simple http?
  5. is there anything interesting in hfs console?
  6. does the request appear in the log?
made1990 commented 4 months ago
  1. GET is working normally
  2. yes, no changes to the script
  3. no proxy
  4. error message with http is a bit different: Empty reply from server
  5. yes .. some errors image => But I am getting the same messages on the console without the server code
  6. neither in access nor error log
rejetto commented 4 months ago

I take we are doing all these tests with "method 1". It may be confusing to mix results.

  1. i'm realizing i don't have a fallback mechanism for metadata on FAT volumes. All my tests on Windows were done on NTFS. Do you confirm that is a FAT file system? Anyway, this is not a fatal problem, and I will take care of it asap. Back to the main topic, there are no extra errors caused by the server code, and yet the request is abnormally interrupted. Weird.

Please tell me about the system you are running on, what Windows version, what about the drive, are you in a virtual environement, anything peculiar you can think of.

I'm going to make a test on a Windows machine now.

rejetto commented 4 months ago

My test of method1 on Windows 11 was successful. The file was written, and I got the 200 reply with "{}" in the body and the X-MD5 header. I'm not sure if to be glad or sad.

You can run hfs with "--dev" parameter. That will add a lot of more info in console. See if there's anything printed with the request. It's worth a shot.

And... I'd rather do it myself but I don't have access to your server. What I'd do is to gradually remove lines from the middleware block until the problem disappears, and then I'd know that the last line I removed is related to the problem. So first I would remove these lines, and test

            return new Promise(res => {
                f.once('end', () => {
                    ctx.set({ 'X-MD5': hasher.digest('hex') })
                    res()
                })
            })

And then remove this, and test again.

            const hasher = createHash('md5')
            f = createReadStream(f)
            f.on('data', x => hasher.update(x))

I expect one of these blocks to be the problem.

made1990 commented 4 months ago

Yes - I am trying method 1

Its an NTFS Filesystem on Windows Server 2016. Its a physical server, but in fact its a virtual filesystem. An application is running on Windows which virtualizes a NTFS filesystem (CBFS) that HFS is writing to.

If I remove the last line of the code, its already solving the issue, but of course then md5 is not returned.

return new Promise(res => {
                f.once('end', () => {
                    ctx.set({ 'X-MD5': hasher.digest('hex') })
                    res()
                })
            })

It is strange still ,because the file is successfully written, i can see it on the filesystem and can open it.

rejetto commented 4 months ago

The code you removed is not needed for the upload, just for the md5, so it's not strange that once the problem is removed the upload still works. Your feedback was helpful anyway.

your cbfs is not supporting ntfs' "alternate streams" feature. that's preventing it to save the information about who uploaded the file, no big deal.

it is possible that your cbfs is doing something funny with the md5 code too, as it may explain the difference between my Windows and yours. So that I try to read the file and i fail, for some reason. I guess that we are getting an error, but that's not handled by the code above. See what happens with this variation

            return new Promise((resolve, reject) => {
                f.once('end', () => {
                    ctx.set({ 'X-MD5': hasher.digest('hex') })
                    resolve()
                }).on('error', reject)
            })

here i'm both printing the error and ensuring to continue serving the request. In case of error you won't get md5, but the request will work AND we can see try to better understand the error.

made1990 commented 4 months ago
            return new Promise((resolve, reject) => {
                f.once('end', () => {
                    ctx.set({ 'X-MD5': hasher.digest('hex') })
                    resolve()
                }).on('error', reject)
            })

HTTP Code 200 returned and file written successfully, but without md5 return. Console output: error middleware plugin ENOENT: no such file or directory

rejetto commented 4 months ago

thanks for your feedback! ok, i think i've got what's going on here. timings are different and while on my system the file has already its final name, it's still with temporary name on yours. i will now see how to solve this.

rejetto commented 4 months ago

ok, it's not a problem in the script. it's a bug in HFS, calling the middleware too early, but only in some occasions. I just made the fix, and it would be wonderful if you could confirm that it's effective for you, before i publish it. I made my tests both on mac and windows. this is the binary 0.53-alpha5.5 hfs-windows.zip or if you are running with npm/npx, you need to npm -g update hfs@exp

rejetto commented 4 months ago

i changed my mind and published. It's alpha6 and you get it as hfs@beta https://github.com/rejetto/hfs/releases/tag/v0.53.0-alpha6 it's actually the same as 5.5, just renamed. Still, your feedback is welcome.

made1990 commented 4 months ago

I can confirm, with alpha-6 the code for method 1 is working :) File written correctly, md5 returned correctly.

I'll do some further testing tomorrow (different file sizes, etc. )

Thx for the great work so far. much appreciated.

rejetto commented 4 months ago

cool! i'm glad we have a better tool now

made1990 commented 4 months ago

Is there some file size limitation when uploading via API ? I uploaded a file with 500MB but it is cut after 250MB and then of course returns the wrong md5. If uploading via GUI the file is uploaded completely. Does not matter if with or without the middleware code.

Dont tried with stable HFS version, just tried alpha-6 now

rejetto commented 4 months ago

There's no known limit. All people reporting same problem eventually solved by removing the limit on their reverse proxy.

made1990 commented 4 months ago

hm, there is no proxy inbetween. Its the same subnet Funny..when uploading a 4GB file its also cut almost at the half, finished the upload after around 2GB. HTTP return code 200

rejetto commented 4 months ago

You didn't say much about what client you are using

made1990 commented 4 months ago

simple curl

rejetto commented 4 months ago

then i'm going to upload a 500+MB file with curl and see what happens

rejetto commented 4 months ago
image

just uploaded 1gb, completely written and md5. version alpha6. then i made same test on a remote (not localhost) server over https, with credentials. Completed again.

I don't know what's different on your side. Ensure you use curl like curl -T file url/ Consider providing a video of what you are doing, because I may see a clue you are not telling.

rejetto commented 4 months ago

also, consider that uploading via API is not really an alternative way, it's the only way. What the frontend does in Chrome is to call the same API that you are calling, and you said it is working fine in that case. You can see the api being called pressing F12 and then using the "network" tab. Just to clarify things.

made1990 commented 4 months ago

oh wow with -T option of curl it works. File uploaded completely, md5 returned. Takes slightly longer than without the code but that makes sense of course Before I usede -d @file to upload the file and it seems that is a different behaviour