hanwen / go-fuse

FUSE bindings for Go
Other
2.04k stars 328 forks source link

Need help debugging weird behavior on go-fuse + macOS + MacFuse 4.x #390

Closed jkowalski closed 1 year ago

jkowalski commented 3 years ago

I've been trying to rewrite filesystem mounting layer in https://github.com/kopia/kopia from bazil.org/fuse to go-fuse (which is excellent, especially v2 API)

The port went mostly smoothly, but I've noticed some really weird behavior with mounted filesystems and macOS Preview app. I don't know which layer the problem is in (my code, go-fuse, macfuse or macOS itself) to even investigate. Any suggestions are appreciated:

The symptoms are really weird:

  1. When I mount snapshots (using kopia mount all /some/mountpoint) everything appears fine - I can see the files, directories, they have the right file mode, attributes, owners and I can see their contents just fine using simple Unix tools (cat, wc, sha256sum, ...) and most GUI tools (like watching videos using VLC for example). I can also see photo thumbnails in Finder.

  2. However, when I try to open some image files from a FUSE mount using Preview it fails with:

The file AAA.jpg cannot be opened. It may be damaged or use a file format that Preview doesn’t recognize.

When this happens, go-fuse invokes the Read() of one byte (which succeeds and returns first byte of the JPEG which is 0xFF) and never reads anything again.

  1. If I first copy the file locally (using Finder Copy/Paste or cp), I'm able to open the file just fine.

Here's my pull request to kopia, seems straightforward, I think: https://github.com/kopia/kopia/pull/829/files

Am I missing something mac-specific I need to enable in go-fuse to get this to work? I don't remember bazil.org/fuse having this behavior, but it stopped working once I upgraded to macfuse 4.x.

jkowalski commented 3 years ago

I tried debugging this deeper. I modified hello example to return one image file and it worked. Then I mounted a directory containing exactly one image file in Kopia and it failed.

I emitted, cleaned up and compared debug logs from both sessions

The only difference between hello (working) and kopia (not working) was:

hello: OPEN n2 {O_RDONLY} => OK, {Fh 0 CACHE} kopia: OPEN n2 {O_RDONLY} => OK, {Fh 1 DIRECT}

And sure enough, disabling direct IO fixed the issue.

hanwen commented 3 years ago

great to hear that you have solved the problem. Do you still need anything here?

jkowalski commented 3 years ago

great to hear that you have solved the problem. Do you still need anything here?

Perhaps some help understanding what could be wrong with direct io here? Would this be a macfuse issue?

hanwen commented 3 years ago

it's hard to tell. You can probably tell from the getattr/open/read sequence what is going on. Normally, a program wouldn't just read 1 byte; is it really requesting 1 byte?

jkowalski commented 3 years ago

it's hard to tell. You can probably tell from the getattr/open/read sequence what is going on. Normally, a program wouldn't just read 1 byte; is it really requesting 1 byte?

This is the cleaned-up and matched sequence of operations as recorded by debug log:

INIT n0 {7.19 Ra 0x100000 VOL_RENAME,CASE_INSENSITIVE,EXPLICIT_INVAL_DATA,ATOMIC_O_TRUNC,XTIMES,0xc000000}  => OK, {7.8 Ra 0x100000  0/0 Wr 0x10000 Tg 0x0}

LOOKUP n1 [".hidden"] 8b => 2=no such file or directory, {n0 g0 tE=30s tA=0s {M00 SZ=0 L=0 0:0 0 0:0 A 0.000000 M 0.000000 C 0.000000}}
LOOKUP n1 [".DS_Store"] 10b => 2=no such file or directory, {n0 g0 tE=30s tA=0s {M00 SZ=0 L=0 0:0 0 0:0 A 0.000000 M 0.000000 C 0.000000}}
LOOKUP n1 ["foo.jpg"] 8b => OK, {n2 g0 tE=30s tA=30s {M0100744 SZ=2431902 L=0 0:0 0 0:3 A 0.000000 M 0.000000 C 0.000000}}

OPEN n2 {O_RDONLY}  => OK, {Fh 1 DIRECT}
FLUSH n2 {Fh 1}  => OK
RELEASE n2 {Fh 1   L0}  => OK

LOOKUP n1 ["foo.jpg"] 8b => OK, {n2 g0 tE=30s tA=30s {M0100744 SZ=2431902 L=0 0:0 0 0:3 A 0.000000 M 0.000000 C 0.000000}}

OPEN n2 {O_RDONLY}  => OK, {Fh 1 DIRECT}
FLUSH n2 {Fh 1}  => OK
RELEASE n2 {Fh 1   L0}  => OK

LOOKUP n1 ["foo.jpg"] 8b => OK, {n2 g0 tE=30s tA=30s {M0100744 SZ=2431902 L=0 0:0 0 0:3 A 0.000000 M 0.000000 C 0.000000}}
OPEN n2 {O_RDONLY}  => OK, {Fh 1 DIRECT}
FLUSH n2 {Fh 1}  => OK
RELEASE n2 {Fh 1   L0}  => OK

OPEN n2 {O_RDONLY}  => OK, {Fh 1 DIRECT}
READ n2 {Fh 1 [0 +1) }  => OK,  1b data "\xff"
FLUSH n2 {Fh 1}  => OK
RELEASE n2 {Fh 1   L0}  => OK

OPEN n2 {O_RDONLY}  => OK, {Fh 1 DIRECT}
FLUSH n2 {Fh 1}  => OK
RELEASE n2 {Fh 1   L0}  => OK
DESTROY n1  => OK
hanwen commented 3 years ago

weird; what does it look like without DIRECT ? On Linux, I'd use strace to see the system calls executed; it seems strange for a program to read 1 byte, but the whole trace looks suspect (why would it open a file to directly close it again?)

jkowalski commented 3 years ago

That's macOS - it's what it does :)

Also, here's the output from modified hello:

INIT n0 {7.19 Ra 0x100000 CASE_INSENSITIVE,EXPLICIT_INVAL_DATA,ATOMIC_O_TRUNC,XTIMES,VOL_RENAME,0xc000000}  => OK, {7.8 Ra 0x100000  0/0 Wr 0x10000 Tg 0x0}

LOOKUP n1 [".hidden"] 8b => 2=no such file or directory, {n0 g0 tE=0s tA=0s {M00 SZ=0 L=0 0:0 0 0:0 A 0.000000 M 0.000000 C 0.000000}}
LOOKUP n1 [".DS_Store"] 10b => 2=no such file or directory, {n0 g0 tE=0s tA=0s {M00 SZ=0 L=0 0:0 0 0:0 A 0.000000 M 0.000000 C 0.000000}}
LOOKUP n1 ["foo.jpg"] 8b => OK, {n2 g0 tE=0s tA=0s {M0100644 SZ=1827552 L=0 0:0 0 0:2 A 0.000000 M 0.000000 C 0.000000}}

OPEN n2 {O_RDONLY}  => OK, {Fh 0 CACHE}
FLUSH n2 {Fh 0}  => OK
RELEASE n2 {Fh 0   L0}  => OK

OPEN n2 {O_RDONLY}  => OK, {Fh 0 CACHE}
FLUSH n2 {Fh 0}  => OK
RELEASE n2 {Fh 0   L0}  => OK

OPEN n2 {O_RDONLY}  => OK, {Fh 0 CACHE}
FLUSH n2 {Fh 0}  => OK
RELEASE n2 {Fh 0   L0}  => OK

OPEN n2 {O_RDONLY}  => OK, {Fh 0 CACHE}
READ n2 {Fh 0 [0 +4096) }  => OK,  4096b data "\xff\xd8\xff\xe0\x00\x10JF"...
SETATTR n2 {atime 1613105922.268628721}  => OK, {tA=0s {M0100644 SZ=1827552 L=0 0:0 0 0:0 A 0.000000 M 0.000000 C 0.000000}}
FLUSH n2 {Fh 0}  => OK
READ n2 {Fh 0 [4096 +28672) }  => OK,  28672b data "        "...
RELEASE n2 {Fh 0   L0}  => OK
LISTXATTR n2 {sz 0}  => OK
jkowalski commented 3 years ago

(note it's slightly different JPEG, that's why SZ is different)

hanwen commented 3 years ago

did this work with DIRECT in bazil.org FUSE ? Why do you need DIRECT ?

the only difference I see is that your user-supplied inode number doesn't match the number we send to the kernel (n2 vs 0:3). I don't know where this difference comes from. I think it shouldn't matter.

jkowalski commented 3 years ago

I used DIRECT by accident, I don't think I need it at all. Just curious. BTW I found 2 issues in osxfuse repo that talk about this behavior, but they refer to old versions of macOS. I have seen this on 11.1.

jkowalski commented 3 years ago

one nit: would be great if there was an option of reporting single-line logs (buffer rx in some kind of map and emit together with tx), otherwise grepping is hard. I had to write a tool to match requests and responses to produce these logs.

hanwen commented 1 year ago

otherwise grepping is hard. I had to write a tool to match requests and responses to produce these logs.

A lot of bugs are due to timing issues, so it is really helpful for me to see some some kind of event ordering.