Checksums on files / reliable up/download

tobiasKaminsky commented 5 years ago

To have reliable up- and downloads, generating a checksum on server is needed.

Upload

client generates checksum
client uploads file with checksum
server checks checksum and stores it after successfull upload

Download

client retrieves file and checksum
client checks checksum

On backward compatibility:

Upload:

if no checksum is provided, e.g. legacy apps / 3rd party apps -> generate checksum on server

Download:

legacy clients can simply ignore checksum

Additional for Android Files we can check if files are already uploaded (despite on relying on file names).

(this was shortly discussed in NC15 planning meeting)

Additionally, the ability to request checksum of file via endpoint from https://github.com/nextcloud/server/issues/25949:

client detects that file with same name already exists
we do not know if it is the same file
client requests only for this file via new(?) endpoint a checksum
client compares it
only if this differs, it presents conflict resolution dialog

nextcloud-bot commented 5 years ago

GitMate.io thinks possibly related issues are https://github.com/nextcloud/server/issues/56 (File Drop: Create confirmation / show file checksum), https://github.com/nextcloud/server/issues/3475 (question about file count), https://github.com/nextcloud/server/issues/6129 (When using the REST API to find out checksums of files, a 404 is returned.), https://github.com/nextcloud/server/issues/7629 (Create Folders when moving files), and https://github.com/nextcloud/server/issues/1371 (No file-uploading possible at all).

jospoortvliet commented 5 years ago

I believe the clients checksums files, not sure if it syncs the checksums and/or if the server verifies them. But that's more meant to check if files are transferred correctly.

tobiasKaminsky commented 5 years ago

This should be done on server side, as even if all clients would compute & upload a checksum, the webUI would be missing and then we cannot use this at all.

tobiasKaminsky commented 5 years ago

@icewind1991 we talked about this on NC15 meeting. Do you think this is feasible to get it still in NC15?

MorrisJobke commented 5 years ago

This should be done on server side, as even if all clients would compute & upload a checksum, the webUI would be missing and then we cannot use this at all.

The big problem with this is that it is nearly impossible to generate the checksums for external storages. We can't do anything there and also updating them in a timely manner is nearly out of scope. For a pure "all data goes through NC" this would be no problem, but the external storages are a real problem.

MorrisJobke commented 5 years ago

cc @rullzer @icewind1991 because we discussed quite a lot about this.

tribut commented 5 years ago

FYI the fact that NC does not have this apparently causes owncloud-client to warn when connecting to a nextcloud instance, which is a problem in Debian because there is no nextcloud-client (yet):

https://alioth-lists.debian.net/pipermail/pkg-owncloud-maintainers/2019-January/003498.html

dassio commented 4 years ago

The big problem with this is that it is nearly impossible to generate the checksums for external storages. We can't do anything there and also updating them in a timely manner is nearly out of scope. For a pure "all data goes through NC" this would be no problem, but the external storages are a real problem.

amazon S3 support to store checksum in metadata, we can check for this one, and this does no exist, we simply leave the checksum empty.

amazon s3 md5 checksum : https://aws.amazon.com/premiumsupport/knowledge-center/data-integrity-s3/

PVince81 commented 3 years ago

this wouldn't only be for data integrity but also for migration scenarios where the data already exists on the server: https://github.com/nextcloud/desktop/issues/1383

for external storages, simply discard the checksum if the file changed on disk (changed mtime/size) and have it recomputed on the fly during the next streaming opportunity (this is how OC does it)

juliusvonkohout commented 2 years ago

Is there really no one working on this basic and essential feature, that has already been implemented by the competition?

mrmatteastwood commented 2 years ago

Hello NextCloud team, I would really appreciate this feature since I am planning to sync over 500 GB of data between my desktop and laptop. All the data is already present on both computers. I'd very, very much prefer not to download all 500 GB to my laptop again after setting up NC on my desktop, when the local data is already identical.

Is anybody working on this issue? Is there a fix planned?

fabioleite0 commented 2 years ago

Sadly I cannot rely on Nextcloud anymore due to this missing feature. I'm not confident about the company statement: "A safe home for all your data"

robjr commented 2 years ago

It would be great if this feature is added to the roadmap. I'm also moving a good amount of data to NC and I'm a bit worried after I found a few files on the server that had half size of the original.

tomaszduda23 commented 2 years ago

As https://github.com/nextcloud/desktop/issues/2483 shows things can go wrong. Lack of way for data integrity check during synchronization is deal breaker.

swordfish90s commented 2 years ago

Hi, can I +1 on this feature.

using: NC 22.2.3

My NC uses local storage, and was trying to : rclone sync google-drive:folder nc:folder and found that all the files sync'd again. Running : rclone hashsum md5 nc:folder shows "Hash unsupported: hash type not supported"

The thread here suggests that the non-local storage presents a problem, however many of those storage platforms support presenting hashsums. Where this is true, a dynamic call could be made to retrieve it if not already stored in NC. The other suggestions of opportunistic hashsum calculation on next streaming seem reasonable.

It may be an unsupportable case where the storage is non-local AND the file is never streamed via NC. So document it as such, and have the feature available for everyone else. In the case of files added to storages not via NC, then they will need occ files:scan anyway, so have an option to add without hashsum if the storage platform does not support it AND the user doesn't want to stream it in order to calculate it. Default would be "do not stream to calculate hashsum if storage provider does not provide them", but you can opt in.

Best of both worlds.

Once this feature is implemented, it would be good to have a way of adding hashsums to existing files via occ.

Thanks, and looking forward to this feature in NC.

Rob

jospoortvliet commented 2 years ago

For Nc 24 we're likely adding a hashing on the server (I know it's in the plan). But note that the client already does compute, store and check a hash.

I do think some users might have bigger expectations of these hashes than is realistic. We are likely only very rarely going to not sync a file because of the hash - I just wouldn't trust it completely. So it won't save much data transfer. Then the only thing it really does is, in theory, warn when bits are changed during data transfer or storage. It won't FIX those situations, or tell you what the right file is, and it's super unlikely these things happen. And IF they happen, there IS already a checksum on http(s) transfers, so you're actually not adding any more reliability. Maybe a good feeling and a bit more load on the server and client is all you get.

Bugs like the one mentioned, mistakes that cause wrong dates or 0 byte files - these wouldn't be fixed somehow by having these checksums. Not saying the have no benefits, but just saying it's not magic. Honestly I think it would be a good idea to update the topic on top with our actual GOAL: what are we trying to accomplish? Because as it's written now, it seems to be a double-check if a file transfer happened correctly. But the http protocol already does that, so that seems quite pointless.

swordfish90s commented 2 years ago

Hi @jospoortvliet ,

Thanks for the update. Yes, I think the title is a bit off. As per your comments using a hash to confirm data integrity after transfer is not required. The real need is to do file comparison beyond the file size + date stamp + filename check.

There are many syncronisation tools (rclone being one of them) that are able to compare the hash to check the need for data transfer. In my scenario NC would need to support MD5 and SHA1 for greatest compatibility. For example Google Drive only supports MD5. So a direct hash comparison is not possible if only SHA1 is supported in NC in that case.

Note you mentioned the "client already does compute, store and check a hash"..... I am looking for the hash to be exposed via webdav.

Thanks, Rob.

paroj commented 2 years ago

I think I just ran into this.

It won't FIX those situations, or tell you what the right file is, and it's super unlikely these things happen. And IF they happen, there IS already a checksum on http(s) transfers, so you're actually not adding any more reliability.

my use-case is updating the tags of my mp3s. Here, one often uses a "preserve timestamp" option not to mess with the "order by latest" view in various music players - so this specific metadata falls flat.

Furthermore, I removed some tags, which I assume means zeroing out some meta-data frame, so the file-size stays the same also.

Now I would love to have a prompt for a client-server data conflict, so I know that the data on the server is outdated.

I guess that there are more file-formats that are prone to this silent not-being-synced issue.

luzfcb commented 4 months ago

I had the bug https://github.com/nextcloud/android/issues/11974 (The version of the Android App with possible fixes is not yet available when I sent this comment)

For whatever reason, the Android app or Server conflict resolver did something wrong and this is what I got in several of my photos.

It basically replaced the original photo on my smartphone and on the server with a black photo

GlassedSilver commented 4 months ago

I had the bug nextcloud/android#11974 (The version of the Android App with possible fixes is not yet available when I sent this comment)

For whatever reason, the Android app or Server conflict resolver did something wrong and this is what I got in several of my photos.

It basically replaced the original photo on my smartphone and on the server with a black photo

Let me guess, are they 0 Bytes? :)

Happened to a good chunk of mine for some months until I finally clued together what possibly causes this. I had never expected NC to be the culprit, an app that's massively adopted, the base for many organizations' own clouds using NC as a backend with obvious need for mobile workflows.

I completely distrusted my phone's NAND over NC for a while.

The fact I need to do manual conflict resolving on unchanged files, that NC zeroes local files before the server has a proper upload...

I'm puzzled by all of this.

Siress commented 4 months ago

See my feedback here: https://github.com/nextcloud/android/issues/11974#issuecomment-1972564781

"Performance issues are tolerable; data loss is not." It's a shame, because NextCloud looked like a really beautiful, mature technology - but it lacks essential performance for a cloud app.

MelonPie commented 2 months ago

Just had an incident because of the lacking integrity check. If one uses NC to store many files this feature is essential.

JulienFS commented 1 month ago

I'm jumping on the train. I ran some JPEG check script on my NC files and discovered that few % of my uploaded images on a specific period of time that lasted 7 months were corrupted with bunch of 0s. Not fully blank but some zeroed segments, sometimes as big as ~500 bytes. JPEG will be "heal-able" but for some other files it's going to be tough to put the right placeholder in the blank spot (PNGs are tough, PDFs are hell)

I think that the focus on synchronization left us with a risky upload tool. The explanations that "migration would be complicated for big systems" when suggesting to add hashes is a bit infuriating. Feature flag with opt-in, major release with breaking changes, that's not something new to the industry. Plus it somehow closes the door for volunteer work on this matter.

But I'm open to disruptive approaches. Maybe a nextcloud app to gather client's hashes on a voluntary basis and that can check that the files that land on the server are OK is good enough, without having to rely on core synchronous mechanism.

GlassedSilver commented 1 month ago

The explanations that "migration would be complicated for big systems"

Not nearly as complicated as cleaning up after potentially months of hundreds or thousands of people relying on sync working when it was silently corrupting files...

Unrecoverably so because folks trusted Nextcloud instances to run on well-backed up drives and their phones having little storage. Too bad good data never reached Nextcloud under certain circumstances.

Yikes... PR nightmare for EVERYONE involved. With all due respect.

JulienFS commented 1 month ago

I did a first exploration of the code server-side. NC heavily relies on Sabre. There are many app/plugins here and there to tune the default behavior. This is going to take some time to have a global overview of everything related to upload and set a plan (chunked upload, bulk upload...). However this is already obvious that the upload process is complex. Even with a careful review here and now, guaranteeing that the process won't mess with file content, there will still be a lot of room for an application error that could result in a loss of data. The hook nature of the framework's API makes me think that any "crappy" app installed could mess with file upload. Checksums at the right level should offer a guarantee that whatever happens on the application level, data integrity would be preserved. Ideally the mecanism should rely on a part of the API whose behavior is not customizable, as we don't want any bad app to be able to mess with it.

rugk commented 1 month ago

Ideally the mecanism should rely on a part of the API whose behavior is not customizable, as we don't want any bad app to be able to mess with it.

Well what about apps that e.g. want to offer a feature like converting/compressing images upon upload? Or removing metadata or similar features? AFAIK extensibility should allow this and you likely cannot prevent them to do anything bad, you need to trust the apps installed.

GlassedSilver commented 1 month ago

Ideally the mecanism should rely on a part of the API whose behavior is not customizable, as we don't want any bad app to be able to mess with it.

Well what about apps that e.g. want to offer a feature like converting/compressing images upon upload? Or removing metadata or similar features? AFAIK extensibility should allow this and you likely cannot prevent them to do anything bad, you need to trust the apps installed.

The point of this mechanism is to ensure the transmission process is graceful and error-free. Whatever happens afterward, such as automated or manual editing of media stored within Nextcloud should be a completely different aspect. What we don't want however is during the process of transmission some app interfering with the data as it's still in transmission. I think that was @JulienFS's point. (correct me if I misunderstood you)

JulienFS commented 1 month ago

Ideally the mecanism should rely on a part of the API whose behavior is not customizable, as we don't want any bad app to be able to mess with it.

Well what about apps that e.g. want to offer a feature like converting/compressing images upon upload? Or removing metadata or similar features? AFAIK extensibility should allow this and you likely cannot prevent them to do anything bad, you need to trust the apps installed.

I would advocate for the decision to be left to the final user. My reasoning is : bringing hashes to the core upload mechanism might require a costly major rework, doing it opportunistically with lightweight plugin based tweaks might not offer strong enough guarantees for integrity. Having a side mechanism that checks end-to-end integrity might be good enough. In this regard I would say : I'd better like compressing/whatever apps to have to integrate to the integrity app than let them potentially mess up with the integrity of my data. In the same reasoning, I'd rather like to choose between integrity and post-upload modification than not having a reliable integrity/sync check.

Obviously there's not only integrity involved in using hashes : silent file modification or manual efficient resync are also to consider. At this point I'm trying to figure out if there's any easy path to bring an opt-in solution for people who consider these problems more than eventual post-upload modifications by other apps.

I'm not a NC developer, simply a FOSS enthusiast which happens to also be a pro dev. I'm willing to put some effort to solve these things but it might be a big piece of work and I'm unsure if I'll have the resources to achieve it. Finding a workaround here and now would at least offer some of us enough time to build or wait for a core feature without having to leave NC.

Ideally the mecanism should rely on a part of the API whose behavior is not customizable, as we don't want any bad app to be able to mess with it.

Well what about apps that e.g. want to offer a feature like converting/compressing images upon upload? Or removing metadata or similar features? AFAIK extensibility should allow this and you likely cannot prevent them to do anything bad, you need to trust the apps installed.

The point of this mechanism is to ensure the transmission process is graceful and error-free. Whatever happens afterward, such as automated or manual editing of media stored within Nextcloud should be a completely different aspect. What we don't want however is during the process of transmission some app interfering with the data as it's still in transmission. I think that was @JulienFS's point. (correct me if I misunderstood you)

That would be the "end game". Obviously keeping NC's modularity while fixing hash related problems is the ideal goal.

I'll keep investigatin, maybe adding hashes it not that a big deal, but I prefer to be careful as I'm not familiar yet with the code base.

PS : if anyone landing here has some product management skills, feel free to start aggregating things related to the lack of hashes and get me in the dev loop.

JulienFS commented 1 month ago

Update from the rabbit hole. There might be something to do with hashes as there's plenty of event to plug to on the Sabre side. At least I feel confident in the possibility to intercept all file writes with enough context. There's a priority mechanism for the event handling which would allow an app to intercept writes before any other app, considering that other apps are "educated" and don't register for a low priority if not needed. I still need to check if we can monitor registered apps to see if anyone has a higher priority, that would be a great way to at least notice if another app could mess our things up.

I found something disturbing during my little exploration : the file creation and update part on the Sabre side. Basically it uses PHP's php://input stream to get the file body, which is fine. But it uses a simple file_put_contents without even checking the result. Maybe worse than that, it looks that for update of an existing file it also uses plain file_put_contents on the existing file. I have no clue of what would happen is the input file stream was ever to be interrupted, especially since there's no error handling here. But even with error handling we would still have overwritten the existing file. https://github.com/sabre-io/dav/blob/7a736b73bea7040a7b6535fd92ff6de9981e5190/lib/DAV/FS/File.php#L25C9-L25C26 https://github.com/sabre-io/dav/blob/7a736b73bea7040a7b6535fd92ff6de9981e5190/lib/DAV/FS/Directory.php#L46C9-L46C26 ERRATUM : see following message

At this point I can't be really sure that there's no extra mechanism plugged somehow that would make the situation better, so lets not jump to conclusion. However I'm still trying to figure out how some zeros ended in my files, and a stream interruption on an update combined with a zero-filled write buffer would match the pattern.

EDIT : I forgot to mention that the bulk upload feature is not leveraging the same code as Sabre's PUT handler. It's re-implementing on top of some of it. For instance and to my understanding files created with bulk upload don't seem to trigger Sabre events, but NC events instead.

JulienFS commented 1 month ago

After digging a little bit more : it's not Sabre's default file handling that is being used. I missed the CachingTree passed to the default implementation constructor, which is the starting point for NC's style file handling. https://github.com/nextcloud/server/blob/54afea4b01385106a241a5161b3894e669716107/apps/dav/lib/Connector/Sabre/File.php#L139 https://github.com/nextcloud/server/blob/54afea4b01385106a241a5161b3894e669716107/apps/dav/lib/Connector/Sabre/Directory.php#L112

Glad to see that the implementation is more complex than a simple file_put_contents. Few things to note at this point :

there is indeed server side error handling : a decent client implementation should be well informed if something went wrong and act accordingly. For example there is a check on the effective file size
there is a hash mechanism, it the client sends the X-HASH header, it should receive a X-Hash-* like header in the responses (that it can check against)
it still looks like we are writing straight on the existing file, but I need to check the whole path generation thing plus the versionning app before reaching any conclusion

I still need to dig deeper.

Bulk upload needs to be checked to see if it uses the same files as the regular DAV endpoint. At least for what I've seen the hash header is not available for bulk upload.
Also the chunked upload also needs some exploration : clients split big files and send them as chunks. This is not standard HTTP chunked encoding, this is an application level chunking introduced to ease the resume of interrupted upload of large file (from what I've read).

At this point it starts to look clear to me that a well implemented client should be able to do reliable upload, especially since there's an optional hash check mechanism. I also think that reliability could be increased by not writing to the final file directly, this way we would preserve actual content and not rely on a reupload by the client to fix the file (but this part still needs investigation). There's a chance that my data corruption was induced by a combination of a bad client(outdated), bad network, and a write final data before sanity checks. The bad client is a big problem that is going to be hard to solve with only server-side work, as even an end to end check would require the client end to be sane. I'm starting to think, in addition of the rest, to some sort of client user agent header check on the server side. We should probably offer an option to block outdated clients or at least ring an alarm. For example the client shipped on ubuntu 22.04 (which is not that old) is few years old and have some bad behavior.

When it comes to faster resync, it could be handled by supplying a client hash to an extra endpoint/http method . It has been said previously that "(you) wouldn't trust a file to exist based on a hash" but I would personally trust a modern secure hash. Obviously we would be drifting a bit from standard DAV, but as long as it is optional it shouldn't be a problem (and I think chunked transfers is not that standard, same for bulk upload). That would be a reasonable tradeoff between network usage and CPU usage if we don't store the hash. Only people with a backend storage that bills more for read than write might have a problem, but they could simply not opt-in (or opt out). Maybe we would have a problem with HTTP timeouts as checksums for big files can take a lot of time to compute (but there are solutions to that). One thing is sure : I would trust a "live" hash but that would be harder for a stored hash, as it would bring more complexity and one more risk of silent data corruption in case of hash desync.

We could also send a hash from the client, in addition to the write final data after sanity check change: this way we could ensure data integrity before writing the final content to the final path. I know that there's already plenty of CRC/checksum mechanism on the underlying mechanisms (TCP has been cited) but here we could safeguard against application level errors.

JulienFS commented 1 month ago

It looks like bulk upload is not using the DAV file handling but is using the native OC file handling. Here is where I landed by trying to follow the epic thread of hooks and inherited classes and autoloaded stuff : https://github.com/nextcloud/server/blob/54afea4b01385106a241a5161b3894e669716107/lib/private/Files/View.php#L625

So the bulk upload doesn't have some handy stuff from NC's Sabre implementation. But it has its own handy stuff. There's an 'in stream' checksum that happens for each embedded file. The problem is that it operates on the raw stream instead of the actual file data, that is fetched from the stream later. If something gets messed up in the actual fetching of data the previous checksum won't detect it. This is better than nothing as it somehow safeguard against dramatic stream handling problem but it could be enhanced.

The chunked upload is on another path than the regular one and doesn't have the checksum compute/check option. I might be wrong but it looks like it blindly pushes the eventual 'oc-checksum' header to the file info cache without checking it. Thus we could have some checksum already sitting somewhere that could be simply wrong.

I don't really know why there are filesystem related classes for DAV/Sabre and other FS related classes in the OwnCloud lib, and why one is not using the other.

At some point I was affraid that NC drifted from OC and that we ended with a mess, but things look pretty similar upstream. That being said, maybe OC has shifted its resources to it's "infinite scale" backend...

So the first thing would be to ensure that we have a hash coming from the client for the four upload scenarios :

bulk upload : the hash would need to be per embedded file, nice we already have it, it's md5 but its probably good enough as hostile hash forging is not a major threat here
regular upload : probably that a header would fit, there's already something computed, we need an extra header with the expected checksum
chunked upload : we could probably pass the whole file checksum in header for every chunk, so we would be sure that's it's there when reassembling the file
browser upload : it is in fact the same as chunked upload, at least I think so, but the browser is the client

For all these scenarios, we should perform a last minute checksum before writing anything to the final destination. It involves writing aside from the final destination during the upload. It is already the case for chunked upload, obviously. In this later case we would need to read the reassembled file twice : once to get the checksum, then another time to write to the final destination.

Ideally in all these scenarios the final compute checksum should be return to the client, a header would fit. That would be a nice way to check that things didn't go crazy.

Now to be extra safe we would probably need an extra round-trip from the client to the server. We could check that was is written on disk is actually what should be written on disk. It might sound paranoiac to do this extra check but with the level on inherited filesystem class involved during a simple write operation that might be an option.

The endpoint in use for this extra check could also be used for periodical/manually triggered sync check. It could also serve when bootstrapping a synchronized directory to detect already existing files remotely.

Considering the mess that filesystem handling is at this point I would strongly advise against server-side hash storing. Maybe there's opportunistic reliable hash storing with some object storage backend but otherwise it would be way too hard to guarantee that any modification to a file event through vanilla NC would properly update the hash.

Most likely, before any hash storage to be implemented, we should consider refactoring the whole filesystem handling. Anyway "live" hashes should be good enough.

I'm going to take some times to think about all of this then I'm probably going to start writing some code if it fits my schedule. If anyone has some suggestion feel free to share.

nextcloud / server

Checksums on files / reliable up/download #11138