owncloud / client

🖥️ Desktop Syncing Client for ownCloud
GNU General Public License v2.0
1.4k stars 667 forks source link

File corruption #2425

Closed chandon closed 9 years ago

chandon commented 10 years ago

Expected behaviour

I create a file in a folder shared with Owncloud and it always upload a correct file (not corrupted)

Actual behaviour

When I create a file in a folder shared with Owncloud it sometimes uploads a corrupted file. The file has the same size as the original one, but is corrupted. If it can help, i've uploaded 2 corruption examples here: https://www.wetransfer.com/downloads/499390103af226851e2ea96237d5487620141029092303/efdb625623a454be881d013a02c10e3e20141029092303/b2bcb7

The original files have been created in the owncloud folder on purpose to test the bug. The corrupted files have been downloaded directly from the server folder /var/www/owncloud/data/test/files

Possibly the same bug as #1969 ?

Steps to reproduce

It's a ramdom bug... I've past massivly 50 files in my owncloud folder, then i did a md5sum on the server to find the corrupted files among all of them

Server configuration

Operating system: Ubuntu 14.04 LTS server Web server: Apache 2.4.7-1 Database: MySQL 5.5.37-0ubuntu0.14.04.1 PHP version: 5.5.9-1ubuntu4.3 ownCloud version: 7.0.2(stable)

List of activated apps: Default apps (i don't use them)

Client configuration

Client version: 1.6.4 build 1195

Operating system: Mac Osx Maverick (we are 2 having the same problem)

LukasReschke commented 10 years ago

@PVince81 I think that might be of interest to you?

PVince81 commented 10 years ago

@chandon did you enable encryption on the server ?

chandon commented 10 years ago

No, as you can see in my description, the examples files are directly downloaded from the server (so they are not encrypted).

A binary compare show that the first 0x4000 bytes are corrupted... like a bad "part" included at the beginning of the corrupted files. You can find the same part later in the file

PVince81 commented 10 years ago

Verrry weird. This guy also had the exact same issue but with encryption enabled: see https://github.com/owncloud/core/issues/10975#issuecomment-55588660

CC @VincentvgNn

I'm suspecting a concurrency / race condition issue... You could try installing and enabling the "files_locking" app.

guruz commented 10 years ago

Do you have the corrupted file open in any application?

chandon commented 10 years ago

no, it's just a copy/paste from my mac osX desktop to the owncloud folder, that's all. neither the desktop files or destination files are open in any application (i don't have virus tool or apps like that, only owncloud accessing files in background)

ckamm commented 10 years ago

I think this could possibly be the client's fault. In 1.7 we avoid uploading files that were changed too recently, but in 1.6 the following could happen:

On the other hand I think it's odd that the file ends up having the right size and that we consistently see the first 16kb being replaced by a later 16kb chunk of the same file. I don't see a reason why the first read() would go wrong in this particular way, but it might happen. It could also be a server bug.

I'll try to reproduce this locally with the 1.6 branch. @chandon maybe you could try whether you still see the issue with the files_locking ownCloud server plugin installed? And if you can, attempting to reproduce the issue with a 1.7 client would also be appreciated - if it's reproducible there it's much more likely to be a server issue.

chandon commented 10 years ago

I've installed the "files_locking" app for 2 days now. I've done a test, i didn't had any corrupted file (which doesn't mean that the problem is solved...). If we still have any corrupted file with the plugin, i'll let you know in a later comment

ckamm commented 10 years ago

@chandon Thank you!

I've tried it locally with 1.6 (on ext4, with SSD) using 50 copies of your test_ORIGINAL.pptx, but couldn't get a corrupt file in 10 attempts. I tried again with various levels of io and hdd load, but could not produce corrupt data. I also tried with 500 files; and with sleeping 100ms between each copy.

ckamm commented 10 years ago

@chandon Did you observe any more corruption?

chandon commented 10 years ago

@ckamm Yes, i did observe a corruption again, always with the MAC OSX Client sync, and with the files_locking enabled. Always the first 0x4000 bytes corrupted (corresponding to another block later...) So the "files_locking" app doesn't help...

ckamm commented 10 years ago

@chandon Thanks for your efforts! Have you tried with a 1.7 client? Do you have any up- or download rate limits configured?

chandon commented 10 years ago

@ckamm I've got the 1.6.4 client. We will install 1.7 client. We don't have up/download limit configured nor proxy

ckamm commented 10 years ago

@chandon Thanks. If this issue is due to the client reading files while they are written to the sync folder, it should be solved with 1.7. If you can still reproduce it, it's more likely to be a communication or server issue

chandon commented 10 years ago

@ckamm thanks i will give a try with 1.7. If i still got any error, i'll report it

moscicki commented 10 years ago

I have seen recent;y a very similar problem using owncloudcmd in our testing framework (https://github.com/cernbox/smashbox/blob/master/lib/test_nplusone.py). The symptoms were very similar: first 16KB block of the file corrupted (corresponding to the 10th block of that file). However this is NOT related to the file update while syncing because all the files are updated BEFORE owncloudcmd is run.

And now I am positively sure that I saw this on 1.6.4.

moscicki commented 10 years ago

The question is: was there some subtle change from 1.6.3 to 1.6.4 that could trigger that? Nobody has seen that before. Is the 1.7 exposed to this too? Could someone do some comparison? Especially between 1.6.3 and 1.6.4 sources the changes should not be so many...

chandon commented 10 years ago

I already had the problem with 1.6.3... i updated to 1.6.4 to check if the problem was solved (because of the changelog : "avoid data corruption due to wrong error handling, bug #2280") but the problem was still there. At this time, i don't have any corrupted file with 1.7 (using it for 6 days now)

moscicki commented 10 years ago

That #2280 was another corruption (that I discovered on server errors).

On 1.6.3 did you also see the 16KB block issue? I have been running many tests on previous versions and never seen that.

@danimo, @dragotin: Could that be that Qt HTTP stuff is not multithreaded fully or some buffers are reused? I guess you added Qt-based propagator in 1.6

On Nov 13, 2014, at 9:36 PM, chandon notifications@github.com wrote:

I already had the problem with 1.6.3... i updated to 1.6.4 to check if the problem was solved (because of the changelog : "avoid data corruption due to wrong error handling, bug #2280") but the problem was still there. At this time, i don't have any corrupted file with 1.7 (using it for 6 days now)

— Reply to this email directly or view it on GitHub.

dragotin commented 10 years ago

@chandon thanks for letting us know your 1.7.0 experience so far. Even though that is not yet a proof, it's already good news :-)

moscicki commented 10 years ago

@chandon: on 1.6.3 did you also see the same 16KB block issue?

I have been running many tests on previous versions and never seen that. And last Friday I saw it quite many times. I thought it may have been related to 1.6.4 update.

Here is the point: I do continous testing and put/get lterally thousands of files. And this issue is not easily reproducible. So I would really like to understand a root cause.

@danimo, @dragotin: Could that be that Qt HTTP stuff is not multithreaded fully or some buffers are reused? I guess you added Qt-based propagator in 1.6

guruz commented 10 years ago

@moscicki You've seen the corruption for the uploads? Or download?

moscicki commented 10 years ago

Always uploads. From our setup I would rather exclude the owncloud server problem. Maybe proxy. Or timing issue on the client.

kuba

On Nov 13, 2014, at 10:20 PM, Markus Goetz notifications@github.com wrote:

@moscicki You've seen the corruption for the uploads? Or download?

— Reply to this email directly or view it on GitHub.

guruz commented 10 years ago

@moscicki And you're just testing with 1.6.4, not 1.7.0 right?

moscicki commented 10 years ago

I just rechecked. Last Friday I saw corruption with 1.6.4

The pattern of my test is quite simple: create 10 files (checksummed), run owncludcmd to upload them, in another directory run owncloudcmd to download them and cross-check the checksum.

kuba

On Nov 13, 2014, at 10:25 PM, Markus Goetz notifications@github.com wrote:

@moscicki And you're just testing with 1.6.4, not 1.7.0 right?

— Reply to this email directly or view it on GitHub.

dragotin commented 10 years ago

@moscicki have you double checked on the server if are fine on the server storage?

moscicki commented 10 years ago

Since I kept the corrupted files (both on the client and on the server) I now reverified it: file was corrupted on upload (so on the server I have a corrupted file).

The download worked fine (i.e. propagated the corruption to the other client).

The filename is the md5checksum of content. Both checksum should match, otherwise corruption.

Uploading client (original uncorrupted file):

28bfb9cb57f85319451f4356c20aa21b test_nplusone-141107-183330/worker0/28bfb9cb57f85319451f4356c20aa21b

Downloading client (downloaded corrupted file):

ed89029183a28f77545d4d938b1007e2 test_nplusone-141107-183330/worker1/28bfb9cb57f85319451f4356c20aa21b

On the server (uploaded corrupted file):

ed89029183a28f77545d4d938b1007e2 eos-smashbox-141107-183330/28bfb9cb57f85319451f4356c20aa21b

On Nov 13, 2014, at 10:38 PM, Klaas Freitag notifications@github.com wrote:

@moscicki have you double checked on the server if are fine on the server storage?

— Reply to this email directly or view it on GitHub.

chandon commented 10 years ago

@moscicki yes exactly the same issue : the first 16KB block are not the original one, but a block you can find later in the file... (so the 16KB block is present twice in the same corrupted file : at the beginning and later...)

ogoffart commented 10 years ago

How far? How big was the total file? Was it split into chunks or not?

moscicki commented 10 years ago

Unchunked (just below the chunk limit in 1.6).

Here you have the files, password: abc, link expires tomorrow: https://testbox.cern.ch/public.php?service=files&t=dc0bafd7d1fcda915db9eb8d53205c25

BTW. It appears that in 1.7 the chunk size was changed to 20MB (instead of 10MB in 1.6). Why?

Please note: this is our test machine and it goes up and down. So please download the files or else tell me when you want to download them so I can make sure that machine is up at that moment.

moscicki commented 10 years ago

@dragotin, @ckamm, @ogoffart: I got another corruption case: this time with 1.7 client on redhat6. So this is not fixed in 1.7 and it is not specific to MacOSX.

Strangely enough, the corrupted file size is again close to 10MB, this time: 9830178 bytes.

@chandon: what are the file size you have seen corruption with?

chandon commented 10 years ago

@moscicki I think the smallest corrupted file was a jpeg having a size close to 700Ko

dragotin commented 10 years ago

@DeepDiver1975 note to self: maybe usage of file_put_contents in lib/private/files/storage/local.php

moscicki commented 10 years ago

my case with corruption on 1.7 client this morning: I don't think it is related to the original corruption issue. it was the problem on our side. forgive the noise.

FYI, last weekend I have run over 150K files to reproduce the corruption. with varying number of client threads: 3,10,50. for the moment I was unable to... :-)

brozkeff commented 10 years ago

I had the last corruption on 5th Nov from Windows 8.1,noticed it few days later when I shared the file via owncloud with the printer company

Compare the correct and corrupted file (PDF from Illustrator) here: https://onedrive.live.com/redir?resid=47D005B1CC05A517!2436&authkey=!ACB4ifUxVtfP6_s&ithint=file%2c7z

I am not sure if there was still the 1.6.4 client installed at that time. Now I see 1.7.0 in Win8.1. If the 1.7 branch of clients fixes the issue I would like to have the option to forbid connecting clients with the bug to the server, forcing the users using bugy client to update immediately. Is that possible somehow to enforce minimum client version to be able to connect the server?

ckamm commented 10 years ago

@chandon @moscicki To summarize what we know: 1) Definitely affected: 1.6.3, 1.6.4 on Windows (chandon) 2) Not related to reading file while it's being written to (moscicki) 3) We haven't seen it on 1.7.0 or non-Windows yet 4) We haven't seen it on chunked files yet

Unfortunately 2 was our best hypothesis and would also have explained why the problem would be solved in 1.7.0.

@brozkeff Your data fits the pattern of this bug perfectly: in your corrupted file the first 16k chunk was replaced with the second. Unfortunately I don't know if a minimum client version enforcement is possible. @dragotin ?

VincentvgNn commented 10 years ago

@ckamm Yesterday I did a new test on file corruption. Same data set as in my previous tests: 500 MB, 3000 files, 300 folders. OC server 7.0.3 at webhost, OC client 1.7.0 if it can work. On Win XP OC client 1.7.0 does not work and there is a problem with huge log files, see #2497. I have 3 PC's, each with an own account, connected to a shared folder, encryption is on.

  1. PC3: user3 WinXP OC client 1.6.4 got the 500 MB copied at once from a source folder into the shared folder and started uploading.
  2. PC2: user2 Win8 OC client 1.7.0 started downloading.
  3. PC1: user1 Win7 OC client 1.7.0 started downloading.

All has been logged. 400 MB+ on clients 1.7.0, 60 MB on WinXP client 1.6.4 and 10 MB on server 7.0.3.

When compared to previous tests things went quite fast. Time needed: about 2 hours i.o. 5 hours.

2 types of file corruption were found:

  1. 14-18 files with the wrong modification date, mtime = ctime, different files for PC2 and PC1. So not wrong on the server. This is now a server 7.0.3 and a client 1.7.0 problem, the same as in previous versions. See https://github.com/owncloud/mirall/issues/2252.
  2. 2 corrupted non-completed files on the server. The files got on the blacklist. Client error message: The item is not synced because of previous errors: The server did not acknowledge the last chunk. (No e-tag were present). A retry sync did not solve the problem. With client 1.6.4. I might just have missed the solution for the blacklist retrial https://github.com/owncloud/mirall/issues/2247. Anyhow I never had this kind of corruption before. Non-completed file 1 is the most difficult to analyse. In the middle of the file path there is a foldername Pré DVT. I can't pass that point via the DirectAdmin file browser and my BlueZone ftp-program even crashes on it! Via the ownCloud web interface I can find all files beyond that folder. I have never seen this problem before. The path length of the problem file is quite long, But other files, even having a longer name, do not show a problem for OC and were copied to to the receiving clients. Non-completed file 2 named 2.6 PQP template_ver1.doc is easier to analyse. The original file size is 175 kB and on the server I see chunks (.part files) of 1x 241 kB and 11x 8 kB. I'll try to find things back in the log files.

My current conclusion:

Latest idea: Maybe I can use SQLite to remove the blacklisted files from the .csync_journal.db file and see if the upload then gets successful.

ckamm commented 10 years ago

@VincentvgNn Thanks! From your description it sounds like an unrelated issue though, could you make a new ticket for it?

brozkeff commented 10 years ago

I propose to try doing future tests on corruption of files on a SLOW link and not a gigabit LAN :-)

A realistic scenario: Our whole company has 10 (down) /1 (up) Mbit aggregated link, shared with a neighbouring company, using permanently two RDP terminal connections to a Windows Server and 6 people having installed ownCloud, and about 7 GB of data are now shared this way. Some are >500MB video files but almost half of it are AI, INDD and exported PDF files from Illustrator and InDesign. When I save a 10MB AI file and make a PDF export of it, it is immediately ~20 MB of changed data to be re-uploaded which takes a few minutes. If I quickly change somehting and save the AI again almost certainly the previous version of the file is still uploading. So sometimes I cannot save the file at all (Illustrator does not allow me it) or it could do problems. Add the thing that at the same I may move dozens of files in another subfolder of owncloud-synced directory structure...

ckamm commented 10 years ago

@brozkeff moscicki says that this bug is unrelated to changing files while they are being uploaded. For chunked files the client detects local modifications when in begins a chunk and aborts the upload.

That said, there may be a bug with changes to small files during upload not being detected, I'll investigate.

@moscicki @chandon Do you have any hints for how to reproduce? I just synced a couple of gigabytes of small-file data with a 1.7 client on Windows 7 and didn't get any corruption...

VincentvgNn commented 10 years ago

@ckamm I'm digging around in the log files. As soon as I have good results for you and when I can repeat it, I will make a new ticket for the error message: The item is not synced because of previous errors: The server did not acknowledge the last chunk. (No e-tag were present). I might post it as a core issue.

moscicki commented 10 years ago

@ckamm: please let me also say that we know that the corruption was not related to the owncloud server backend because I have seen it against our own webdav backend (exactly with the 16KB block symptom as @chandon). As far as I can remember I think I also saw it on 1.7.0 (but this is not 100% reliable information).

Steps to (possibly) reproduce (this is how I saw this problem): https://github.com/cernbox/smashbox/tree/master/corruption_test

I also have impression that I saw in your code places with 16K buffers, e.g.: https://github.com/owncloud/mirall/blob/f25d175b5d4494f6c1cd5c81c46cf6627f55c51a/src/libsync/filesystem.cpp

moscicki commented 10 years ago

@ckamm: please note this: https://github.com/cernbox/smashbox/blob/master/protocol/checksum.md

ckamm commented 10 years ago

@moscicki Thanks for the note about the server. So you have a completely independent compatible backend that got the same bad upload from the client?

Reproducing: I'm somewhat confused: didn't the issue appear only on Windows? Can I get smashbox to work on Windows?

16k buffers: The only place I found was the one you pointed out too - but that's in a file compare function that only gets called for possible conflicts in the first place. That's not it.

Checksums: Sounds good. Since we have no idea about the cause of the corruption, let's at least detect and fix it. Why do you suggest sending the full checksum for chunked uploads instead of a partial one that could be used to validate each chunk?

PVince81 commented 10 years ago

Not sure if related, but on another thread about file corruption there seems to be a relationship with the PHP setting "output_buffering": https://github.com/owncloud/core/issues/10984#issuecomment-65405319

Would be good to know what value you guys used for that one ?

ckamm commented 10 years ago

@PVince81 Thank you for the heads up. This corruption issue seems to happen during upload though, so I doubt it's related to PHP's output_buffering.

ckamm commented 10 years ago

@chandon @moscicki Are you by any chance using mod_fastcgi_handler? A corruption issue with similar symptoms was recently traced back to that: https://github.com/owncloud/core/issues/10984#issuecomment-65544867

chandon commented 10 years ago

@ckamm no i'm not usig mod_fastcgi_handler

moscicki commented 9 years ago

We don't use owncloud's PHP server at all while seeing this issue. Hence I must conclude that this is a client problem (maybe in combination with nginx proxy).

guruz commented 9 years ago

@moscicki Could you tell us which Qt version you are using on the problematic machines? Also if the issue exists if you use the OWNCLOUD_USE_LEGACY_JOBS export?