Closed chandon closed 9 years ago
@PVince81 I think that might be of interest to you?
@chandon did you enable encryption on the server ?
No, as you can see in my description, the examples files are directly downloaded from the server (so they are not encrypted).
A binary compare show that the first 0x4000 bytes are corrupted... like a bad "part" included at the beginning of the corrupted files. You can find the same part later in the file
Verrry weird. This guy also had the exact same issue but with encryption enabled: see https://github.com/owncloud/core/issues/10975#issuecomment-55588660
CC @VincentvgNn
I'm suspecting a concurrency / race condition issue... You could try installing and enabling the "files_locking" app.
Do you have the corrupted file open in any application?
no, it's just a copy/paste from my mac osX desktop to the owncloud folder, that's all. neither the desktop files or destination files are open in any application (i don't have virus tool or apps like that, only owncloud accessing files in background)
I think this could possibly be the client's fault. In 1.7 we avoid uploading files that were changed too recently, but in 1.6 the following could happen:
On the other hand I think it's odd that the file ends up having the right size and that we consistently see the first 16kb being replaced by a later 16kb chunk of the same file. I don't see a reason why the first read() would go wrong in this particular way, but it might happen. It could also be a server bug.
I'll try to reproduce this locally with the 1.6 branch. @chandon maybe you could try whether you still see the issue with the files_locking ownCloud server plugin installed? And if you can, attempting to reproduce the issue with a 1.7 client would also be appreciated - if it's reproducible there it's much more likely to be a server issue.
I've installed the "files_locking" app for 2 days now. I've done a test, i didn't had any corrupted file (which doesn't mean that the problem is solved...). If we still have any corrupted file with the plugin, i'll let you know in a later comment
@chandon Thank you!
I've tried it locally with 1.6 (on ext4, with SSD) using 50 copies of your test_ORIGINAL.pptx, but couldn't get a corrupt file in 10 attempts. I tried again with various levels of io and hdd load, but could not produce corrupt data. I also tried with 500 files; and with sleeping 100ms between each copy.
@chandon Did you observe any more corruption?
@ckamm Yes, i did observe a corruption again, always with the MAC OSX Client sync, and with the files_locking enabled. Always the first 0x4000 bytes corrupted (corresponding to another block later...) So the "files_locking" app doesn't help...
@chandon Thanks for your efforts! Have you tried with a 1.7 client? Do you have any up- or download rate limits configured?
@ckamm I've got the 1.6.4 client. We will install 1.7 client. We don't have up/download limit configured nor proxy
@chandon Thanks. If this issue is due to the client reading files while they are written to the sync folder, it should be solved with 1.7. If you can still reproduce it, it's more likely to be a communication or server issue
@ckamm thanks i will give a try with 1.7. If i still got any error, i'll report it
I have seen recent;y a very similar problem using owncloudcmd in our testing framework (https://github.com/cernbox/smashbox/blob/master/lib/test_nplusone.py). The symptoms were very similar: first 16KB block of the file corrupted (corresponding to the 10th block of that file). However this is NOT related to the file update while syncing because all the files are updated BEFORE owncloudcmd is run.
And now I am positively sure that I saw this on 1.6.4.
The question is: was there some subtle change from 1.6.3 to 1.6.4 that could trigger that? Nobody has seen that before. Is the 1.7 exposed to this too? Could someone do some comparison? Especially between 1.6.3 and 1.6.4 sources the changes should not be so many...
I already had the problem with 1.6.3... i updated to 1.6.4 to check if the problem was solved (because of the changelog : "avoid data corruption due to wrong error handling, bug #2280") but the problem was still there. At this time, i don't have any corrupted file with 1.7 (using it for 6 days now)
That #2280 was another corruption (that I discovered on server errors).
On 1.6.3 did you also see the 16KB block issue? I have been running many tests on previous versions and never seen that.
@danimo, @dragotin: Could that be that Qt HTTP stuff is not multithreaded fully or some buffers are reused? I guess you added Qt-based propagator in 1.6
On Nov 13, 2014, at 9:36 PM, chandon notifications@github.com wrote:
I already had the problem with 1.6.3... i updated to 1.6.4 to check if the problem was solved (because of the changelog : "avoid data corruption due to wrong error handling, bug #2280") but the problem was still there. At this time, i don't have any corrupted file with 1.7 (using it for 6 days now)
— Reply to this email directly or view it on GitHub.
@chandon thanks for letting us know your 1.7.0 experience so far. Even though that is not yet a proof, it's already good news :-)
@chandon: on 1.6.3 did you also see the same 16KB block issue?
I have been running many tests on previous versions and never seen that. And last Friday I saw it quite many times. I thought it may have been related to 1.6.4 update.
Here is the point: I do continous testing and put/get lterally thousands of files. And this issue is not easily reproducible. So I would really like to understand a root cause.
@danimo, @dragotin: Could that be that Qt HTTP stuff is not multithreaded fully or some buffers are reused? I guess you added Qt-based propagator in 1.6
@moscicki You've seen the corruption for the uploads? Or download?
Always uploads. From our setup I would rather exclude the owncloud server problem. Maybe proxy. Or timing issue on the client.
kuba
On Nov 13, 2014, at 10:20 PM, Markus Goetz notifications@github.com wrote:
@moscicki You've seen the corruption for the uploads? Or download?
— Reply to this email directly or view it on GitHub.
@moscicki And you're just testing with 1.6.4, not 1.7.0 right?
I just rechecked. Last Friday I saw corruption with 1.6.4
The pattern of my test is quite simple: create 10 files (checksummed), run owncludcmd to upload them, in another directory run owncloudcmd to download them and cross-check the checksum.
kuba
On Nov 13, 2014, at 10:25 PM, Markus Goetz notifications@github.com wrote:
@moscicki And you're just testing with 1.6.4, not 1.7.0 right?
— Reply to this email directly or view it on GitHub.
@moscicki have you double checked on the server if are fine on the server storage?
Since I kept the corrupted files (both on the client and on the server) I now reverified it: file was corrupted on upload (so on the server I have a corrupted file).
The download worked fine (i.e. propagated the corruption to the other client).
The filename is the md5checksum of content. Both checksum should match, otherwise corruption.
Uploading client (original uncorrupted file):
28bfb9cb57f85319451f4356c20aa21b test_nplusone-141107-183330/worker0/28bfb9cb57f85319451f4356c20aa21b
Downloading client (downloaded corrupted file):
ed89029183a28f77545d4d938b1007e2 test_nplusone-141107-183330/worker1/28bfb9cb57f85319451f4356c20aa21b
On the server (uploaded corrupted file):
ed89029183a28f77545d4d938b1007e2 eos-smashbox-141107-183330/28bfb9cb57f85319451f4356c20aa21b
On Nov 13, 2014, at 10:38 PM, Klaas Freitag notifications@github.com wrote:
@moscicki have you double checked on the server if are fine on the server storage?
— Reply to this email directly or view it on GitHub.
@moscicki yes exactly the same issue : the first 16KB block are not the original one, but a block you can find later in the file... (so the 16KB block is present twice in the same corrupted file : at the beginning and later...)
How far? How big was the total file? Was it split into chunks or not?
Unchunked (just below the chunk limit in 1.6).
Here you have the files, password: abc, link expires tomorrow: https://testbox.cern.ch/public.php?service=files&t=dc0bafd7d1fcda915db9eb8d53205c25
BTW. It appears that in 1.7 the chunk size was changed to 20MB (instead of 10MB in 1.6). Why?
Please note: this is our test machine and it goes up and down. So please download the files or else tell me when you want to download them so I can make sure that machine is up at that moment.
@dragotin, @ckamm, @ogoffart: I got another corruption case: this time with 1.7 client on redhat6. So this is not fixed in 1.7 and it is not specific to MacOSX.
Strangely enough, the corrupted file size is again close to 10MB, this time: 9830178 bytes.
@chandon: what are the file size you have seen corruption with?
@moscicki I think the smallest corrupted file was a jpeg having a size close to 700Ko
@DeepDiver1975 note to self: maybe usage of file_put_contents
in lib/private/files/storage/local.php
my case with corruption on 1.7 client this morning: I don't think it is related to the original corruption issue. it was the problem on our side. forgive the noise.
FYI, last weekend I have run over 150K files to reproduce the corruption. with varying number of client threads: 3,10,50. for the moment I was unable to... :-)
I had the last corruption on 5th Nov from Windows 8.1,noticed it few days later when I shared the file via owncloud with the printer company
Compare the correct and corrupted file (PDF from Illustrator) here: https://onedrive.live.com/redir?resid=47D005B1CC05A517!2436&authkey=!ACB4ifUxVtfP6_s&ithint=file%2c7z
I am not sure if there was still the 1.6.4 client installed at that time. Now I see 1.7.0 in Win8.1. If the 1.7 branch of clients fixes the issue I would like to have the option to forbid connecting clients with the bug to the server, forcing the users using bugy client to update immediately. Is that possible somehow to enforce minimum client version to be able to connect the server?
@chandon @moscicki To summarize what we know: 1) Definitely affected: 1.6.3, 1.6.4 on Windows (chandon) 2) Not related to reading file while it's being written to (moscicki) 3) We haven't seen it on 1.7.0 or non-Windows yet 4) We haven't seen it on chunked files yet
Unfortunately 2 was our best hypothesis and would also have explained why the problem would be solved in 1.7.0.
@brozkeff Your data fits the pattern of this bug perfectly: in your corrupted file the first 16k chunk was replaced with the second. Unfortunately I don't know if a minimum client version enforcement is possible. @dragotin ?
@ckamm Yesterday I did a new test on file corruption. Same data set as in my previous tests: 500 MB, 3000 files, 300 folders. OC server 7.0.3 at webhost, OC client 1.7.0 if it can work. On Win XP OC client 1.7.0 does not work and there is a problem with huge log files, see #2497. I have 3 PC's, each with an own account, connected to a shared folder, encryption is on.
All has been logged. 400 MB+ on clients 1.7.0, 60 MB on WinXP client 1.6.4 and 10 MB on server 7.0.3.
When compared to previous tests things went quite fast. Time needed: about 2 hours i.o. 5 hours.
2 types of file corruption were found:
The item is not synced because of previous errors: The server did not acknowledge the last chunk. (No e-tag were present)
. A retry sync did not solve the problem. With client 1.6.4. I might just have missed the solution for the blacklist retrial https://github.com/owncloud/mirall/issues/2247. Anyhow I never had this kind of corruption before.
Non-completed file 1 is the most difficult to analyse. In the middle of the file path there is a foldername Pré DVT
. I can't pass that point via the DirectAdmin file browser and my BlueZone ftp-program even crashes on it! Via the ownCloud web interface I can find all files beyond that folder.
I have never seen this problem before. The path length of the problem file is quite long, But other files, even having a longer name, do not show a problem for OC and were copied to to the receiving clients.
Non-completed file 2 named 2.6 PQP template_ver1.doc
is easier to analyse.
The original file size is 175 kB and on the server I see chunks (.part files) of 1x 241 kB and 11x 8 kB.
I'll try to find things back in the log files.My current conclusion:
Latest idea:
Maybe I can use SQLite to remove the blacklisted files from the .csync_journal.db
file and see if the upload then gets successful.
@VincentvgNn Thanks! From your description it sounds like an unrelated issue though, could you make a new ticket for it?
I propose to try doing future tests on corruption of files on a SLOW link and not a gigabit LAN :-)
A realistic scenario: Our whole company has 10 (down) /1 (up) Mbit aggregated link, shared with a neighbouring company, using permanently two RDP terminal connections to a Windows Server and 6 people having installed ownCloud, and about 7 GB of data are now shared this way. Some are >500MB video files but almost half of it are AI, INDD and exported PDF files from Illustrator and InDesign. When I save a 10MB AI file and make a PDF export of it, it is immediately ~20 MB of changed data to be re-uploaded which takes a few minutes. If I quickly change somehting and save the AI again almost certainly the previous version of the file is still uploading. So sometimes I cannot save the file at all (Illustrator does not allow me it) or it could do problems. Add the thing that at the same I may move dozens of files in another subfolder of owncloud-synced directory structure...
@brozkeff moscicki says that this bug is unrelated to changing files while they are being uploaded. For chunked files the client detects local modifications when in begins a chunk and aborts the upload.
That said, there may be a bug with changes to small files during upload not being detected, I'll investigate.
@moscicki @chandon Do you have any hints for how to reproduce? I just synced a couple of gigabytes of small-file data with a 1.7 client on Windows 7 and didn't get any corruption...
@ckamm
I'm digging around in the log files.
As soon as I have good results for you and when I can repeat it, I will make a new ticket for the error message:
The item is not synced because of previous errors: The server did not acknowledge the last chunk. (No e-tag were present)
.
I might post it as a core issue.
@ckamm: please let me also say that we know that the corruption was not related to the owncloud server backend because I have seen it against our own webdav backend (exactly with the 16KB block symptom as @chandon). As far as I can remember I think I also saw it on 1.7.0 (but this is not 100% reliable information).
Steps to (possibly) reproduce (this is how I saw this problem): https://github.com/cernbox/smashbox/tree/master/corruption_test
I also have impression that I saw in your code places with 16K buffers, e.g.: https://github.com/owncloud/mirall/blob/f25d175b5d4494f6c1cd5c81c46cf6627f55c51a/src/libsync/filesystem.cpp
@ckamm: please note this: https://github.com/cernbox/smashbox/blob/master/protocol/checksum.md
@moscicki Thanks for the note about the server. So you have a completely independent compatible backend that got the same bad upload from the client?
Reproducing: I'm somewhat confused: didn't the issue appear only on Windows? Can I get smashbox to work on Windows?
16k buffers: The only place I found was the one you pointed out too - but that's in a file compare function that only gets called for possible conflicts in the first place. That's not it.
Checksums: Sounds good. Since we have no idea about the cause of the corruption, let's at least detect and fix it. Why do you suggest sending the full checksum for chunked uploads instead of a partial one that could be used to validate each chunk?
Not sure if related, but on another thread about file corruption there seems to be a relationship with the PHP setting "output_buffering": https://github.com/owncloud/core/issues/10984#issuecomment-65405319
Would be good to know what value you guys used for that one ?
@PVince81 Thank you for the heads up. This corruption issue seems to happen during upload though, so I doubt it's related to PHP's output_buffering.
@chandon @moscicki Are you by any chance using mod_fastcgi_handler? A corruption issue with similar symptoms was recently traced back to that: https://github.com/owncloud/core/issues/10984#issuecomment-65544867
@ckamm no i'm not usig mod_fastcgi_handler
We don't use owncloud's PHP server at all while seeing this issue. Hence I must conclude that this is a client problem (maybe in combination with nginx proxy).
@moscicki Could you tell us which Qt version you are using on the problematic machines? Also if the issue exists if you use the OWNCLOUD_USE_LEGACY_JOBS
export?
Expected behaviour
I create a file in a folder shared with Owncloud and it always upload a correct file (not corrupted)
Actual behaviour
When I create a file in a folder shared with Owncloud it sometimes uploads a corrupted file. The file has the same size as the original one, but is corrupted. If it can help, i've uploaded 2 corruption examples here: https://www.wetransfer.com/downloads/499390103af226851e2ea96237d5487620141029092303/efdb625623a454be881d013a02c10e3e20141029092303/b2bcb7
The original files have been created in the owncloud folder on purpose to test the bug. The corrupted files have been downloaded directly from the server folder /var/www/owncloud/data/test/files
Possibly the same bug as #1969 ?
Steps to reproduce
It's a ramdom bug... I've past massivly 50 files in my owncloud folder, then i did a md5sum on the server to find the corrupted files among all of them
Server configuration
Operating system: Ubuntu 14.04 LTS server Web server: Apache 2.4.7-1 Database: MySQL 5.5.37-0ubuntu0.14.04.1 PHP version: 5.5.9-1ubuntu4.3 ownCloud version: 7.0.2(stable)
List of activated apps: Default apps (i don't use them)
Client configuration
Client version: 1.6.4 build 1195
Operating system: Mac Osx Maverick (we are 2 having the same problem)