microsoft / WSL

Issues found on WSL
https://docs.microsoft.com/windows/wsl
MIT License
17.44k stars 822 forks source link

rsync hangs on particular file #2138

Open Wishmesh opened 7 years ago

Wishmesh commented 7 years ago

Also, strange thing is:

sunjoong commented 7 years ago

@Wishmesh - What name and path name of that particular file? I had a strange experience with file named "ext" and directory named "ext"; My anti-virus blocked that file and directory.

Wishmesh commented 7 years ago

@sunjoong - no. Anti-virus have these folders excluded. Path looks like:

X:\1-prj\some\simple\folders\here\Release_x64\appname.pch
or
X:\1-prj\some\simple\folders\here\Release_x64\appname.pdb

There were other extensions too, but I do not remember of what type they was...

therealkenc commented 7 years ago

Unfortunately I can't repro here. Strange files to be having trouble with. Their size isn't that special, nor should that be a problem anyway. TCP streams are very reliable on WSL these days. Any chance those files are open in something (anything) in Windows? Or special in any way you can think of, versus the other files in your tree?

One thing you might try is to do it old school and see what happens:

tar cpf - /mnt/u/somewhere | (ssh some_name@192.168.99.17 "cd remote_dir; tar xpf -")

That doesn't get you all the rsync goodness, but it will help eliminate filesystem and networking problems. If that works, try rsync without the --delete flag and see if that helps. Otherwise, shrug 🤷.

Wishmesh commented 7 years ago

Any chance those files are open in something (anything) Windows?

No. It stuck on the same file again, and again. Even after reboot. Antivirus disabled for that drive.

Ok. Will try out your suggestions next time it happens.

therealkenc commented 7 years ago

It's not antivirus; and even if it were, those files still aren't very special compared to all the object files you must have in there. I rsync around enormous trees (like chromium) somewhat regularly, but that is on VolFS (/home/me). I rsynced a random Visual Studio project directory DrvFS to DrvFS (granted localhost to localhost) trying to repro just before I posted but it went fine. Anyway, bonne chance.

Wishmesh commented 7 years ago

I have 2 folders (about 10GB): 1) 36193 files and 6142 folders 2) 33734 files and 5975 folders

Nothing extraordinary. And "file being used / not accessible" is less probable, because.... When it stuck, I can copy the same stuck file using the same rsync syntax, but specifying filename -- syncing by one file works... and then after that, again rsync works for some days or weeks without any problem.

therealkenc commented 7 years ago

...for some days or weeks without any problem

Yeah that's going to be a bitch to track. So obviously it's going to work for the one-off test I did. Sigh...

That the fail is sticky even after reboot is very relevant. That means it pretty much has to be a "filesystem thing", but I couldn't speculate what exactly. Appreciate you reporting though, so at least it is on the books.

Wishmesh commented 7 years ago

That means it pretty much has to be a "filesystem thing"

But why rsync then works if I sync the same files one by one? And both - source and destination FS are NTFS, if that matters.

therealkenc commented 7 years ago

But why rsync then works if I sync the same files one by one?

Dunno, but it pretty much has to be a "filesystem thing" by deductive reasoning. Because your DrvFS (NTFS) filesystem is the only thing that can hold the sticky failure state across a reboot. Your sync one-by-one fix is, for whatever reason, getting things back into a sane state. Next time it fails, try deleting the problematic files on the server that you are pushing to and doing a full sync, instead of syncing them one at a time.

Handwaving here, the whole process of deleting files on DrvFS is not straightforward, because delete semantics are different in Unix and Windows. That why the --delete option is a red flag. Or, at least for lack of a better working theory atm.

Wishmesh commented 7 years ago

Yeah that's going to be a bitch to track.

Let us try....

1) Today I tried hard to reproduce and I succeeded


Try 1
some/folders/sub/folders/src/Debug/rc.read.1.tlog
          2,896 100%   10.79kB/s    0:00:00 (xfr#24, ir-chk=1011/41352)
some/folders/sub/folders/src/Debug/rc.write.1.tlog
            250 100%    0.93kB/s    0:00:00 (xfr#25, ir-chk=1010/41352)
some/folders/sub/folders/src/Debug/vc100.idb
          1,200   0%    4.47kB/s    0:05:24

Try2
some/folders/sub/folders/src/Debug_x64/some-name.log
          3,331 100%   28.04kB/s    0:00:00 (xfr#25, ir-chk=1034/41390)
some/folders/sub/folders/src/Debug_x64/some-name.obj
      1,397,284 100%   11.01MB/s    0:00:00 (xfr#26, ir-chk=1033/41390)
some/folders/sub/folders/src/Debug_x64/some-name.pch
         10,640   0%   85.17kB/s    0:05:33

Try3
some/folders/sub/folders/src/Debug_x64/some-name.pch
         10,640   0%    0.00kB/s    0:00:00

2) The strange thing is, it sometimes shows 5+ minutes required transfer time almost immediately.

3) When it stuck at some-name.pch, I checked MD5 of file on client and server -- they differ.

4) When I said that state remains after reboot I somehow forgot about server state. I blamed client. So now I rebooted the server too -- both computers rebooted. So now it is confirmed -- it persists after reboot of both -- server and client.

sending incremental file list
deleting some/folders/sub/folders/src/some-name.opensdf
some/folders/sub/folders/src/
some/folders/sub/folders/src/some-name.sdf
     41,963,520 100%   26.62MB/s    0:00:01 (xfr#1, ir-chk=1025/41327)
some/folders/sub/folders/src/some-name.suo
         25,600 100%   51.02kB/s    0:00:00 (xfr#2, ir-chk=1023/41327)
some/folders/sub/folders/src/Debug_x64/
some/folders/sub/folders/src/Debug_x64/some-name.pch
         32,768   0%   56.04kB/s    0:08:25  ^C
rsync error: unexplained error (code 130) at rsync.c(632) [sender=3.1.1]
[sender] io timeout after 60 seconds -- exiting

5) Checked offending file on the server. The file some-name.pch is missing on the server.

6) Trying rsync again. It still stuck on the same file (even it is not on the server):

sending incremental file list
some/folders/sub/folders/src/
some/folders/sub/folders/src/some-name.sdf
     41,963,520 100%  303.08MB/s    0:00:00 (xfr#1, ir-chk=1025/41327)
some/folders/sub/folders/src/some-name.suo
         25,600 100%  187.97kB/s    0:00:00 (xfr#2, ir-chk=1023/41327)
some/folders/sub/folders/src/Debug_x64/some-name.pch
         32,768   0%  233.58kB/s    0:02:01  ^C
rsync error: unexplained error (code 130) at rsync.c(632) [sender=3.1.1]
[sender] io timeout after 60 seconds -- exiting

7) Notice that before the some-name.pch file it synced .sdf and .suo file again. I am not deeply familiar with rsync. It writes multiple files at once? Why send .sdf and .suo again, when they were synced last time successfully?

8) Deleting entire folder on the server: some/folders/sub/folders/src/Debug_x64

9) Trying rsync again:

sending incremental file list
some/folders/sub/folders/src/
some/folders/sub/folders/src/some-name.sdf
     41,963,520 100%  305.40MB/s    0:00:00 (xfr#1, ir-chk=1025/41327)
some/folders/sub/folders/src/some-name.suo
         25,600 100%  185.19kB/s    0:00:00 (xfr#2, ir-chk=1023/41327)
some/folders/sub/folders/src/Debug_x64/
some/folders/sub/folders/src/Debug_x64/Base64.obj
         10,358 100%   11.60kB/s    0:00:00 (xfr#3, ir-chk=1045/41389)
some/folders/sub/folders/src/Debug_x64/CL.read.1.tlog
         45,746 100%   12.38MB/s    0:00:00 (xfr#4, ir-chk=1044/41389)
some/folders/sub/folders/src/Debug_x64/CL.write.1.tlog
          3,618 100%  588.87kB/s    0:00:00 (xfr#5, ir-chk=1043/41389)
some/folders/sub/folders/src/Debug_x64/**********.obj
         70,915 100%    1.50MB/s    0:00:00 (xfr#6, ir-chk=1042/41389)
some/folders/sub/folders/src/Debug_x64/*********.obj
         15,243 100%  236.28kB/s    0:00:00 (xfr#7, ir-chk=1041/41389)
some/folders/sub/folders/src/Debug_x64/*********.obj
^Crsync error: unexplained error (code 130) at rsync.c(632) [sender=3.1.1]
[sender] io timeout after 60 seconds -- exiting

Now it stuck on the some.obj file.

10) Ctrl+C, try again... again stuck on the same.obj file...

11) Remembered about strace...

strace -o strace.txt -ff rsync with params

Got 2 strace files. Unfortunately I cannot share them here as is... But will provide you with parts I you will need.

First file seems to come from ssh. 178974 bytes long.

dup2(3, 0)                              = 0
close(4)                                = 0
close(5)                                = 0
dup2(6, 1)                              = 1
close(3)                                = 0
close(6)                                = 0
fcntl(0, F_GETFL)                       = 0x802 (flags O_RDWR|O_NONBLOCK)
fcntl(0, F_SETFL, O_RDWR)               = 0
execve("/usr/local/sbin/ssh", ["ssh", "-l", ....
execve("/usr/bin/ssh", ["ssh", "-l", .....
access("/etc/ld.so.nohwcap", F_OK)      = -1 ENOENT (No such file or directory)
mmap(NULL, 8192, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0) = 0x7fd49ca20000
access("/etc/ld.so.preload", R_OK)      = -1 ENOENT (No such file or directory)
...
... skip here ...
...
clock_gettime(CLOCK_BOOTTIME, {2396, 116297000}) = 0
read(3, "\216k__\225G!\263\265G}\374T\256\253\346\233\373>\305z][a\rb\32\211T\20\270\300"..., 8192) = 8192
clock_gettime(CLOCK_BOOTTIME, {2396, 116528000}) = 0
clock_gettime(CLOCK_BOOTTIME, {2396, 116644000}) = 0
select(7, [3 4], [], NULL, NULL)        = 1 (in [3])
clock_gettime(CLOCK_BOOTTIME, {2396, 116915000}) = 0
read(3, "\303\220\n\231\212\234\355+\277\34n5\\^\21O\326\366\324\372N\0\225\276n\313\355U\357\347`\335"..., 8192) = 8192
clock_gettime(CLOCK_BOOTTIME, {2396, 117163000}) = 0
clock_gettime(CLOCK_BOOTTIME, {2396, 117278000}) = 0
select(7, [3 4], [5], NULL, NULL)       = 2 (in [3], out [5])
clock_gettime(CLOCK_BOOTTIME, {2396, 117552000}) = 0
write(5, "\374\377\0\7,K\252G\377\n\260\23[\24DW@\312\233#\244\216\241\333\2300e\334L\24\212_"..., 16384) = 16384
read(3, "2\35\202\326T\32$F\276i\324\327k[\312\256\362\224/\255/z\247\17\243\261\346)\3464\312\236"..., 8192) = 8192
clock_gettime(CLOCK_BOOTTIME, {2396, 117901000}) = 0
clock_gettime(CLOCK_BOOTTIME, {2396, 118047000}) = 0
select(7, [3 4], [3], NULL, NULL)       = 2 (in [3], out [3])
clock_gettime(CLOCK_BOOTTIME, {2396, 118324000}) = 0
read(3, "\5\260\374\270\275\305\350\270\355E^|\243Ku\216\177\3\266&\317@\10\335t\362\261kP\360Ts"..., 8192) = 8192
write(3, "7\2571u\265?\347\4\212Ig]\t\21<JC\332.\243\206\257k\246\216\266\252\5\v\r'N"..., 36) = 36
clock_gettime(CLOCK_BOOTTIME, {2396, 118728000}) = 0
clock_gettime(CLOCK_BOOTTIME, {2396, 118843000}) = 0
select(7, [3 4], [5], NULL, NULL)       = 2 (in [3], out [5])
clock_gettime(CLOCK_BOOTTIME, {2396, 119116000}) = 0
write(5, ";aW|v\223Yu\26\253w0\334\330\212\344\205\273\246^\370\"x@13\327<\240.G<"..., 16384) = 16384
read(3, "\376i\255\251\245\361ft\321\325\351_\264\337\333\361\363W`\37\235\361\267\3709\216\335\4\357\313H\336"..., 8192) = 8192
clock_gettime(CLOCK_BOOTTIME, {2396, 119464000}) = 0
clock_gettime(CLOCK_BOOTTIME, {2396, 119579000}) = 0
select(7, [3 4], [], NULL, NULL)        = 1 (in [3])
clock_gettime(CLOCK_BOOTTIME, {2396, 119841000}) = 0
read(3, "\340\305\217i\325f2\267\246\211?Q\2gm\33\246\325\f\35\365\241d\322\255\370\300\275$\312\275D"..., 8192) = 8192
clock_gettime(CLOCK_BOOTTIME, {2396, 120085000}) = 0
clock_gettime(CLOCK_BOOTTIME, {2396, 120200000}) = 0
select(7, [3 4], [5], NULL, NULL)       = 2 (in [3], out [5])
clock_gettime(CLOCK_BOOTTIME, {2396, 120474000}) = 0
write(5, "#\3W+\246\246FS\23\330\253?\207\1u\323\253\351\2563h3\26\301\3171\317}\244\352\212-"..., 16384

2nd file from rsync. 9277601 bytes long:

execve("/usr/bin/rsync", ["rsync", "--archive", "--stats", "--progress", "--delete", ......
brk(NULL)                               = 0x7fffe73de000
access("/etc/ld.so.nohwcap", F_OK)      = -1 ENOENT (No such file or directory)
mmap(NULL, 8192, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0) = 0x7f993b200000
access("/etc/ld.so.preload", R_OK)      = -1 ENOENT (No such file or directory)
open("/etc/ld.so.cache", O_RDONLY|O_CLOEXEC) = 3
fstat(3, {st_mode=S_IFREG|0644, st_size=21423, ...}) = 0
mmap(NULL, 21423, PROT_READ, MAP_PRIVATE, 3, 0) = 0x7f993b206000
close(3)                                = 0
access("/etc/ld.so.nohwcap", F_OK)      = -1 ENOENT (No such file or directory)
open("/lib/x86_64-linux-gnu/libattr.so.1", O_RDONLY|O_CLOEXEC) = 3
read(3, "\177ELF\2\1\1\0\0\0\0\0\0\0\0\0\3\0>\0\1\0\0\0\300\20\0\0\0\0\0\0"..., 832) = 832
...
...
...
open("some/folders/sub/folders/src/Debug_x64/*********.obj", O_RDONLY) = 3
fstat(3, {st_mode=S_IFREG|0777, st_size=70915, ...}) = 0
write(1, "some/folders/sub/*******"..., 64) = 64
read(3, "d\206\265\0\225C%Yj\256\0\0\317\2\0\0\0\0\0\0.drectve\0\0\0\0"..., 70915) = 70915
gettimeofday({1495618771, 246185}, NULL) = 0
ioctl(1, TIOCGPGRP, [546])              = 0
write(1, "\r         32,768  46%   50.71kB/"..., 46) = 46
select(6, [5], [4], [5], {60, 0})       = 2 (in [5], out [4], left {59, 999999})
read(5, "\rT\v\320\16u\203&#0\25\352\353\313\204#\5x;", 19) = 19
write(4, "Q\216\0\7\377\376^\0S\0:\0\\\0001\0-\0P\0R\0J\0\\\0S\0O\0F\0"..., 36437) = 36437
gettimeofday({1495618771, 246814}, NULL) = 0
gettimeofday({1495618771, 246891}, NULL) = 0
gettimeofday({1495618771, 246961}, NULL) = 0
write(1, "\r         70,915 100%  109.58kB/"..., 72) = 72
close(3)                                = 0
open("some/folders/sub/folders/src/Debug_x64/*******.obj", O_RDONLY) = 3
fstat(3, {st_mode=S_IFREG|0777, st_size=15243, ...}) = 0
write(1, "some/folders/sub/*****"..., 65) = 65
read(3, "d\206\17\0\225C%Yf)\0\0f\0\0\0\0\0\0\0.drectve\0\0\0\0"..., 15243) = 15243
gettimeofday({1495618771, 247785}, NULL) = 0
ioctl(1, TIOCGPGRP, [546])              = 0
write(1, "\r         15,243 100%   23.52kB/"..., 46) = 46
gettimeofday({1495618771, 248074}, NULL) = 0
write(1, "\r         15,243 100%   23.52kB/"..., 72) = 72
close(3)                                = 0
open("some/folders/sub/folders/src/Debug_x64/some.obj", O_RDONLY) = 3
fstat(3, {st_mode=S_IFREG|0777, st_size=1071815, ...}) = 0
write(1, "some/folders/sub/********"..., 61) = 61
read(3, "d\206s\6\225C%Y\206\321\t\0\343\36\0\0\0\0\0\0.drectve\0\0\0\0"..., 262144) = 262144
select(6, [5], [4], [5], {60, 0})       = 2 (in [5], out [4], left {59, 999998})
read(5, "\32\371\177\2075\240'\347\247\223\245\2647\310\252\372\301\251\205\303T\324%R=\253\n\257\26\251\371\235"..., 38) = 38
write(4, "\350\320\0\7\0\0&\30\0\0\0\0\0\0\0\0\0ATL::CSimpleStr"..., 53484

12) The client and the server is now in the state, where I can reliably repeat the hang :) So if I can somehow help further, just give me instructions.

Wishmesh commented 7 years ago

And one more thing. I am starting to suspect that it is hanging only on the files, that are created with Visual Studio that is running as Elevated process under admin privileges. But it is not easy to reproduce, because in most time it sync these files easily.

sunjoong commented 7 years ago

@Wishmesh - I don't know why that happened (hmm.... perhaps timestamp or file closing problem??), but remember you said;

I am trying to rsync folders between two Windows

If so.... between two Windows.... using robocopy might be a meanwhile workaround, I think.

therealkenc commented 7 years ago

write(4, "\350\320\0\7\0\0&\30\0\0\0\0\0\0\0\0\0ATL::CSimpleStr"..., 53484

Those writes are likely blocking because the send buffers are full. Which means it is the server (receiving) side where you are going to need to do the strace. Which is going to be a [. . .]

Visual Studio that is running as Elevated process under admin privileges.

That could matter. If you insist on running VS as admin, then you are probably going to want to try running bash with elevated privileges too; assuming you are not already. That might not (probably won't) unstick a stuck rsync, but, spitballing, it might avoid whatever is getting you in the state in the first place. That goes double if you are accessing those files with elevated privilege on the server as well, as opposed to just pushing them there for backups.

  1. Checked offending file on the server. The file some-name.pch is missing on the server.

For giggles take a peek in %appdatalocal%/lxss/temp (look but don't touch) on the server and see if there is anything untoward in there. See #1940. You have a totally unrelated scenario, of course, but you do have a situation with deletes going on under failure conditions. Failure conditions, which are probably uncommon enough to explain why you go days or weeks without seeing a problem.

Hopefully the devs take a look. Even more hopefully the problem just gets fixed magically with ongoing improvements, which happens a lot around here.

If so.... between two Windows.... using robocopy might be a meanwhile workaround, I think.

Well yeah. Or Cygwin rsync. But where would be the fun in that.

Wishmesh commented 7 years ago

Robocopy isn't an option, because I need to sync files to non-Windows machines too. Cygwin sucks. Bash on Windows is king 👍 And of course I can workaround the problem, but fix of this issue would be better ;)

in %localappdata%/lxss/temp (not %appdatalocal%) nothing unusual: 0000000000000000_devzero 0000000000000001_unlink 0000000000000002_unlink 0000000000000003_unlink 0000000000000004_unlink 0000000000000005_tmpfs 0000000000000006_unlink 0000000000000007_tmpfs 0000000000000008_unlink 0000000000000009_tmpfs 000000000000000a_unlink 000000000000000b_tmpfs 000000000000000c_unlink 000000000000000d_tmpfs 000000000000000e_unlink 000000000000000f_unlink 0000000000000010_unlink

Also, tried to run Bash as Elevated admin. It still stops on the same file. Cannot say if running as elevated would avoid getting stuck in the first place.

On server files are not touched / accessed. Like in the backup.

sunjoong commented 7 years ago

@Wishmesh

Cygwin sucks. Bash on Windows is king 👍

Haha... Then.... Does Msys2 rsync make same problem too? Compairing is a good tool to clearify issue, I think. And... Is that sure all Visual Studio windows were closed before launching rsync?

therealkenc commented 7 years ago

in %localappdata%/lxss/temp (not %appdatalocal%) nothing unusual: 0000000000000001_unlink .... etc

What are the size of those files? Or more specifically, do they look familiar in the context of your errant files (pch etc).

On server files are not touched / accessed. Like in the backup.

Wish you wouldn't have said that. That negates most theories related to elevated privileges. You client strace looks clean. You don't have any of the usual fs related gotchas there. Since you are blocking on a network write, by definition either the server isn't reading or the bytes, or the bytes didn't get there in the first place.

If your use case is backups, and there aren't other circumstances (like not having space on C:), you might want to push to /home or /var on the server. Eliminates a variable, anyway.

sunjoong commented 7 years ago

@Wishmesh - I think @therealkenc points good one. You might sync files under /mnt/<dir> directory. What if under /home/<dir> or /var/<dir>?

Wishmesh commented 7 years ago

Haha... Then.... Does Msys2 rsync make same problem too? Compairing is a good tool to clearify issue, I think. And... Is that sure all Visual Studio windows were closed before launching rsync?

1) Unable to compare with Msys2. 2) Of course VS closed. As I said. Full reboot for both - client and server. Nothing running except services... Start Bash... run rsync... hung...

What are the size of those files?

These are folders in the temp directory... I said - nothing unusual, because they are identical to what I see in other unrelated to this problem PC.

About syncing to /home ... just tried the same folder and of course rsync works. Remember I said in the first posts... that if I try to rsync individual files (that are stuck), then it unstacks... and everything works again from this point.

So I currently have stuck the following folder: some/folders/sub/folders/src/Debug_x64/some-file.obj on the file some-file.obj

I just did rsync to /home/user/testing ... and it works I also tried to rsync to /mnt/drive/some-other-path ... and of course it works. Remember, it need many tries to get to the "stuck state" -- some days or weeks.

After testing these I run original rsync command: rsync /mnt/u/some/ ... And it stuck on the same file: some/folders/sub/folders/src/Debug_x64/some-file.obj

So I suspect that if I will run: rsync /mnt/u/some/folders/ ... or rsync /mnt/u/some/folders/sub/ ... ... it will work... and the stuck state would be gone. Because rsync on individual file unstacks...

Do you have other ideas to try, before I issue rsync with "deeper" path?

sunjoong commented 7 years ago

@Wishmesh - I could not figure it out why that happen but... think there are still many traps in using DrvFS, i.e., /mnt/<dir>; DrvFS does not fully compatible with VolFS, i.e., /home/<dir> or /var/<dir>, and rsync of WSL is a real linux binary and considered to be run within linux filesystem (in this case, VolFS), so it might not fully compatiable with DrvFS yet. However, Cygwin or Msys2 use (modified) windows program (and considered to be able to be run within Windows filesystem); They don't have the same concept of DrvFS and VolFS (but they use another approach.)

I thought, if you could fix the point of condition when that problem occurs, i.e., if you could reproduce it when you want it, you could compare it with non DrvFS case. But... you look like to say it's impossible.

And... I don't understand @therealkenc's saying enogh but... guess it might be about sort of limits during programs running. You know, there are some directories in %localappdata%\lxss\temp; These directoris are changing contents during programs running.

Wishmesh commented 7 years ago

I couldn't resist to try.... So I have one file stuck: /mnt/x/1/2/3/4/5/some-file.ext

1) running rsync /mnt/x/1 user@IP:/mnt/x/abc/ result -- hang 2) running rsync /mnt/x/1/2 user@IP:/mnt/x/abc/1/ result -- hang 3) running rsync /mnt/x/1/2/3 user@IP:/mnt/x/abc/1/2/ result -- hang 4) running rsync /mnt/x/1/2/3/4/5 user@IP:/mnt/x/abc/1/2/3/4/ result -- hang 5) running rsync /mnt/x/1/2/3/4/5/some-file.ext user@IP:/mnt/x/abc/1/2/3/4/5/ result -- success

6) running command from the step 1 again rsync /mnt/x/1 user@IP:/mnt/x/abc/ success

Issue is gone till the next time :(

Wishmesh commented 7 years ago

And if it is filesystem issue, then why syncing files one by one works?

sunjoong commented 7 years ago

@Wishmesh

And if it is filesystem issue, then why syncing files one by one works?

Good question; You know, people say good question when they don't know a quite answer.

I cannot be sure that's filesystem issue; I'm just guessing it might be or not and it could be timestamp related issue if it might be a filesystem issue. Of course, it might not be timestamp related, it might even not filesystem issue, it might be a issue I could not imagine. But I know there is a timestamp issue in DrvFS itself that make a strange thing, and found another timestamp issue of rsync (but it might not fit on your case) https://askubuntu.com/questions/112863/rsync-not-working-between-ntfs-fat-and-ext .

therealkenc commented 7 years ago

About syncing to /home ... just tried the same folder and of course rsync works.

Sorry my poor explanation. When I said "you might want to push to /home or /var on the server", I meant as operating procedure moving forward. Something is getting borked on the server side. We don't know what, because no server side strace. But I think you'll have a lower probability of borkage on the server if you push to VolFS, because smart money says this is a DrvFS "filesystem thing". [You'll keep using DrvFS on the client of course, because the whole point is having access in Visual Studio. There's no problem we can see on the client.]

Do you have other ideas to try, before I issue rsync with "deeper" path?

Once you push to any different folder without corruption you're golden. Go deeper, sideways, to a whole different server, whatever.

Hold on to the corrupt folder in its pristine corrupt state if you have the space. The devs are going to need diagnostics of some kind on that tree, because no one will be able to repro this locally. You've got a real bug. But note that the chances of anyone reading this far down the thread is rapidly decreasing.

sunilmut commented 7 years ago

Adding @SvenGroot to see if he can help out with this one. Also, marking this one as a bug.

SvenGroot commented 7 years ago

@Wishmesh Could you follow the steps from section 8 at https://github.com/Microsoft/BashOnWindows/blob/master/CONTRIBUTING.md, and provide us with the log files? That may provide some insight in to what is happening.

therealkenc commented 7 years ago

His problem is on the server-side of an WSL-to-WSL rsync push. There's gotta be a way to collect a server-side strace but it isn't immediately obvious to me. The client-side he supplied above. It hangs on:

write(4, "\350\320\0\7\0\0&\30\0\0\0\0\0\0\0\0\0ATL::CSimpleStr"..., 53484

...which is not of much use, because it is not failing, just blocking. The problem isn't network-related because the problem persists across a reboot and only once the DrvFS filesystem on the server-side has gotten into a corrupt state.

SvenGroot commented 7 years ago

Since the server is also WSL, logs of both sides would be helpful. Strace would be helpful but I can see why that would be difficult in this scenario. However, ETL logs should be easy if you have access to the server.

Thanks, Sven

Wishmesh commented 7 years ago

Thanks for looking into this!

1) I tried to get the server-side strace with the following command: sudo strace -ff -o mylog /usr/sbin/sshd I got two files - mylog.41 and mylog.42, however I do not see the relevant lines in contrast to client strace logs.

If you have idea how to strace rsync on server, let me know.

2) I got some log files using logman command. Particularly the lxcore_kernel.etl seems to contain something useful, but I am unable to read it using EventViewer... missing some decoder perhaps. As it seems that it contains private info, I do not feel comfortable to share it here. Maybe I can share it to some email with @ microsoft.com at the end?

jlahijani commented 7 years ago

I'm experiencing this issue as well. It occurs on this image. test

nvsystems commented 7 years ago

Same problem here: rsync on W10 randomly does stop transferring files even though it usually works fine! The files do not seem to be of a special content or suspicious filename. Rebooting both the source Windows 10 machine and the destination Linux server does not change it, still hangs.

Deleting the entire target directory and rsyncing the contents again as well as temporarily moving stuff around and back on the source Windows Machine works (not sure which one of the two thing did the trick today), but the bug will be back a couple of days/weeks later.

Very annoying! Major bug here.

therealkenc commented 7 years ago

Rebooting both the source Windows 10 machine and the destination Linux server does not change it, still hangs.

The nice thing is once it gets in an inconsistent state, it stays in the inconsistent state. We're still waiting (since June) on guidance on how to debug this on the server side. An strace from the client doesn't help, because the client just sits there waiting on the (blocked) server, in a perfectly normal way. The first step it getting a strace from the server.

Unfortunately I don't have an answer on how to do that. Any suggestion I could think of off the top of my head involves a lot of hand waving.

nvsystems commented 7 years ago

Got into an inconsistent state again. Same directory as before, different file - strange why so often this specific directory. So I did a lot of hand waiving and I wrote myself a small bash script to quickly grep 'ps aux' for the PIDs from the two 'rsync --server' processes running on the "real" linux server and made it start "strace -pPID -ff -tt -o rsync-log.txt &" twice.

Now I have two large (22M, 87K) strace logs from the server. I cannot share them completely because they contain a huge list of private filenames, but I can share the relevant bits and especially how it ends. Why are there 2 rsync processed on the server? I have no clue but this is normal during rsync AFAIK.

Output rsync on windows:

sending incremental file list PwrHaus/Bak/ PwrHaus/Bak/20170804.bak [now hangs indefinitely]

Output from rsync strace (Server Side, first PID): https://pastebin.com/FihPTpHE => in the first few lines, up to about line 17 you can see it going through files that required no syncing and where there was no problem (last one was "060303_CSI_Betriebs...". Then it arrives at the problematic PwrHaus/Bak/20170804.bak and starts do to something (?!?) for about 100 lines and then ends with "select(1, [0], [], [0], {tv_sec=60, tv_usec=0}) = 0 (Timeout)" repeating every minute while the transfer hangs.

Output from rsync strace (Server Side, second PID): https://pastebin.com/BN2zz512 The problematic file (20170804.bak) strangely does not appear at or near the bottom of this trace. I had to cut out a couple thousand lines in between because of private filename data, but those were pretty irrelevant. This process seems to skip over 20170804.bak just fine, just noting that it does not exist (yes, it does not - will require syncing since it is a new file). Then it continues on to check A LOT of other files. But this process also ends with a "select(4, [3], [], [3], {tv_sec=60, tv_usec=0}) = 0 (Timeout)" repeating every minute.

I hope this can help someone figuring out what is happening here. Personally, I do not have much of a clue what this strace log from the server reveals - first time I've ever used strace in my life. I hope someone does.

bvibber commented 7 years ago

Possible workaround for freezes on partial transfers: use the -W (--whole-file) option to force complete re-transfer of the problematic files. This avoids the freezing for me, at the cost of additional bandwidth for files that changed since last run.

(I've found that I see regular freezes like this rsyncing files in the 1-2 gig range when I'm doing a second pass after some files have changed: file is transferred at size X, then later transferred again after it's grown to size Y. This is rsyncing from WSL on Win10 Creator's Update to macOS.)

rib commented 7 years ago

Just to note that the above -W workaround suggestion seems to work for me too (thanks). In my case I'm hitting freezes syncing from WSL on Win10 Creator's update to another machine via ssh with the same WLS + Win10 Creator's update.

Here I'm not seeing freezes with large files that have changed, I'm seeing freezes in a first-pass sync to copy across files that are missing on the remote machine, not changed. Most of the files I'm currently trying to sync happen to be source code and ELF32/64, ARM/AArch64, shared library binaries built for Android, some in the order of 100MB but most files < 1M.

nvsystems commented 7 years ago

The -W parameter works for me too. Why it works is a bit strange because the files that are transferred are new files and not files on which any kind of delta resume would needed.

therealkenc commented 7 years ago

Why it works is a bit strange because the files that are transferred are new files and not files on which any kind of delta resume would needed.

-W takes a different code path entirely, so even without file deltas, rsync is hitting different syscall surface. I think, possibly, the delta algorithm might be banging on mmap() and causing problems. I haven't done any work to investigate that hypothesis, mind. Anyway, "something" is different in the delta code path.

If you do a grandpa-style tar pipe (mentioned in May), which is more-or-less the same as -W, it won't hang either. [edit: best evidence is -W is not a viable work-around.]

tar cpf - /mnt/u/some_path | (ssh some_name@somewhere "cd remote_path; tar xpf -")

You can even toss a z in there if you want compression. Not that there is any advantage to using tar over rsync -W. [edit] Except the part where -W isn't helping for some folks. (posts below)

geovex commented 7 years ago

Unfortunately -W did not help me. It fails if you have large number of files to transfer.

justyns commented 7 years ago

I'm running into the same issue. I was hoping to replace my cygwin+rsync backup scripts with wsl+bash+rsync, but so far rsync hangs forever after transferring maybe 30-50 files. I can ctrl+c the process and start it over for it to get hung on a different file (after transferring the one it hung on previously) so I'm not sure if the type of file or specific files matter.

Here's the last bit of an strace when using the -W rsync argument:

[pid  7423] gettimeofday({1510548955, 543152}, NULL) = 0
          2,973 100%    6.18kB/s    0:00:00 (xfr#86, to-chk=151302/152236)
) = 76
[pid  7423] time(NULL)                  = 1510548955
[pid  7423] stat("/etc/localtime", {st_mode=S_IFREG|0644, st_size=1072, ...}) = 0
[pid  7423] write(3, "2017/11/12 22:55:55 [7423] <f+++"..., 135) = 135
[pid  7423] close(4)                    = 0
[pid  7423] open("justyns/.vscode/extensions/donjayamanne.python-0.7.0/out/client/providers/jediHelpers.js", O_RDONLY) = 4
[pid  7423] fstat(4, {st_mode=S_IFREG|0777, st_size=5123, ...}) = 0
[pid  7423] write(1, "justyns/.vscode/extensions/donja"..., 89justyns/.vscode/extensions/donjayamanne.python-0.7.0/out/client/providers/jediHelpers.js
) = 89
[pid  7423] read(4, "\"use strict\";\nObject.definePrope"..., 5123) = 5123
[pid  7423] select(7, [6], [5], [6], {5, 0}) = 2 (in [6], out [5], left {4, 999999})
[pid  7423] read(6, "\0\0\0\0\0\0\0\2\0\240\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\1\0\240\0\0\0"..., 437) = 437
[pid  7423] time(NULL)                  = 1510548955
[pid  7423] write(5, "\366\362\0\7\"use strict\";\nObject.defineP"..., 62202

And without the -W:

[pid  7427] gettimeofday({1510550470, 869685}, NULL) = 0
             35 100%    0.07kB/s    0:00:00 (xfr#97, to-chk=151238/152236)
) = 76
[pid  7427] time(NULL)                  = 1510550470
[pid  7427] stat("/etc/localtime", {st_mode=S_IFREG|0644, st_size=1072, ...}) = 0
[pid  7427] write(3, "2017/11/12 23:21:10 [7427] <f+++"..., 119) = 119
[pid  7427] close(4)                    = 0
[pid  7427] open("justyns/.vscode/extensions/donjayamanne.python-0.7.0/pythonFiles/PythonTools/ipythonServer.py", O_RDONLY) = 4
[pid  7427] fstat(4, {st_mode=S_IFREG|0777, st_size=22951, ...}) = 0
[pid  7427] write(1, "justyns/.vscode/extensions/donja"..., 94justyns/.vscode/extensions/donjayamanne.python-0.7.0/pythonFiles/PythonTools/ipythonServer.py
) = 94
[pid  7427] read(4, "\nimport sys\nimport socket\nimport"..., 22951) = 22951
[pid  7427] select(7, [6], [5], [6], {5, 0}) = 2 (in [6], out [5], left {4, 999998})
[pid  7427] read(6, "\0\0\0\0\0\0\0\1\0\240\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\1\0\240\0\0\0"..., 76) = 76
[pid  7427] time(NULL)                  = 1510550470
[pid  7427] write(5, "K\342\0\7import os\nimport io\nimport r"..., 57935

Using a tar | ssh tar pipe doesn't have any issues and completes succesfully.

roboprint commented 6 years ago

The same bug for me, but -W doesn't help.

mihoe commented 6 years ago

Same thing here. rsync on WSL/Ubuntu uploading via ssh to a native Ubuntu Server always stalls forever after a few files. --whole-file option doesn't help.

So it's back to cygwin (which works flawlessly for the exact same set of files) until WSL gets a working rsync.

thedewi commented 6 years ago

My WSL rsync/ssh client also hangs after sending a few files even when they are small. It's not a particular file; it gets about 1MiB further each run. I wonder if pipes are prone to getting stuck; cygwin had a similar issue 10 years ago.

For now I've gone back to using git's msys2 environment with an rsync binary dropped in from the main msys2 repo.

jjriv commented 6 years ago

I am experiencing this same issue going over SSH from a Windows OS to CentOS 7. It would stall out every time on the same file, at exactly 56%.

Then I tried emptying the folder on the CentOS 7 server. The rsync would then stall out on a different file, but at 1%, every time.

GregDThomas commented 6 years ago

FWIW, I'm seeing the same problem (to Ubuntu 16.04.3), and the -W flag did solve it for me.

acmuller commented 6 years ago

I've had the same problem, ever since I began to use Win10 Ubuntu, one it's first release. Backing up with rsync to my server, it always hangs on Thunderbird data files. Usually they are a bit large, but not huge. If I run the same script on cygwin or Ubuntu in VMWare, it works fine. I've been looking around for a long time for someone else having this problem, and now, here it is. I have never been able to find a solution, so I have no advice--just another voice to say that it's a real problem, and unfortunate, since rsync is the most important thing I need to use.

therealkenc commented 6 years ago

it's a real problem, and unfortunate

Indeed, it is.

Keep in mind for anyone unfortunate enough land here, this is blocked on Sven's post last June:

Could you follow the steps from section 8 at https://github.com/Microsoft/BashOnWindows/blob/master/CONTRIBUTING.md,

This is a client-server scenario, so it is going to need an strace on both sides. I don't have a baby-step walkthough for doing that, and none have been offered. I imagine it is going to require launching rsycnd manually though strace. We have client-side examples, but they don't say much because the client just blocks on a write(); which in of itself isn't notable if the server is stuck. In the OP manifestation at least, once it gets stuck it stays stuck (even through a reboot), which is "good" in the sense that it is reproducible.

For my part, I rsync between a local VM and WSL all the time with ginormous trees using LxFs and it has always been fine. But that isn't saying much if you are over the wire and/or using DrvFs. Or maybe there are other variables. But since it works for me I also take as given it works for the folks at MSFT, so here this sits about to celebrate its first birthday in May.

programmylife commented 6 years ago

Posting to say that my HDD does not have this problem, but my SSD does and swapping the -P flag for the -W flag did fix it for me.

acmuller commented 6 years ago

Thank you for this. Swapping the -P flag for the -W flag works for me as well, both on HDD and SSD. I've been trying to resolve this for 2 years, so much appreciated!

mumbleskates commented 6 years ago

I'm having the same issue. It occurs with great reliability when syncing over ssh and attempting to push a diff-update for a folder with a large virtual disk file (120 gigabytes). Each time it is run it will make it through the similar portion, update a segment of the file, then hang... the server side's relevant rsync process continues to work very hard for I believe about 5 minutes, then goes away; meanwhile, the client side never updates again.

jietang1750 commented 6 years ago

Same issue here. It happens while backing up my music library. It always gets stuck on the same file until I hit ctrl-c. It will then copy a few more files and gets stuck again. Adding "--whole-file" flag fixed the problem for me.

acmuller commented 6 years ago

Yes, adding "--whole-file" is OK as a workaround, but that's not how Rsync is supposed work, optimally, right? Adding the flag does push the files through whole, but the process definitely runs slower. I think a proper fix still needs to be found for this.