seaweedfs / seaweedfs

SeaweedFS is a fast distributed storage system for blobs, objects, files, and data lake, for billions of files! Blob store has O(1) disk seek, cloud tiering. Filer supports Cloud Drive, cross-DC active-active replication, Kubernetes, POSIX FUSE mount, S3 API, S3 Gateway, Hadoop, WebDAV, encryption, Erasure Coding.
https://seaweedfs.com
Apache License 2.0
22.59k stars 2.28k forks source link

Mount crashed during rename #3104

Open thaines opened 2 years ago

thaines commented 2 years ago

As described by title; might be related to (https://github.com/chrislusf/seaweedfs/issues/3039) but the log attached to that one diverges from what happened here and crashes at a different point, so I would guess not.

Seaweed 3.06, 3 masters, 3 filers, 7 nodes of volumes, 8 mount points, Cassandra 4.0 for DB, all as secure as possible. Can share configuration details if helpful.

Here is what the mount process printed as it died:

May 25 08:44:15 bigchin weed[18019]: I0525 08:44:15 18019 weedfs_file_mkrm.go:88] mknod /homes/<username>/Research/fsom/.git/refs/heads/main.lock: CreateEntry : insert entry /homes/<username>/Research/fsom/.git/refs/heads/main.lock: insert /homes/<username>/Research/fsom/.git/refs/heads/main.lock: EOF
May 25 08:44:15 bigchin weed[18019]: I0525 08:44:15 18019 wfs_filer_client.go:29] WithFilerClient 0 138.38.108.118:18888: CreateEntry : insert entry /homes/<username>/Research/fsom/.git/refs/heads/main.lock: insert /homes/<username>/Research/fsom/.git/refs/heads/main.lock: EOF
May 25 08:44:15 bigchin weed[18019]: I0525 08:44:15 18019 weedfs_file_mkrm.go:88] mknod /homes/<username>/Research/fsom/.git/refs/heads/main.lock: CreateEntry : insert entry /homes/<username>/Research/fsom/.git/refs/heads/main.lock: insert /homes/<username>/Research/fsom/.git/refs/heads/main.lock: EOF
May 25 08:44:15 bigchin weed[18019]: I0525 08:44:15 18019 wfs_filer_client.go:29] WithFilerClient 1 138.38.108.121:18888: CreateEntry : insert entry /homes/<username>/Research/fsom/.git/refs/heads/main.lock: insert /homes/<username>/Research/fsom/.git/refs/heads/main.lock: EOF
May 25 08:45:17 bigchin weed[18019]: I0525 08:45:17 18019 wfs_filer_client.go:29] WithFilerClient 0 138.38.108.115:18888: dir Rename /homes/<username>/Research/fsom => /homes/<username>/Research/fsom1 receive: rpc error: code = Unknown desc = /homes/<username>/Research/fsom move error: fail to move /homes/<username>/Research/fsom => /homes/<username>/Research/fsom1: fail to move /homes/<username>/Research/fsom/.git => /homes/<username>/Research/fsom1/.git: fail to move /homes/<username>/Research/fsom/.git/hooks => /homes/<username>/Research/fsom1/.git/hooks: fail to move /homes/<username>/Research/fsom/.git/hooks/pre-rebase.sample => /homes/<username>/Research/fsom1/.git/hooks/pre-rebase.sample: filer: no entry is found in filer store
May 25 08:45:17 bigchin weed[18019]: I0525 08:45:17 18019 wfs_filer_client.go:29] WithFilerClient 1 138.38.108.118:18888: dir Rename /homes/<username>/Research/fsom => /homes/<username>/Research/fsom1 receive: rpc error: code = Unknown desc = /homes/<username>/Research/fsom1 is not empty
May 25 08:45:17 bigchin weed[18019]: I0525 08:45:17 18019 wfs_filer_client.go:29] WithFilerClient 2 138.38.108.121:18888: dir Rename /homes/<username>/Research/fsom => /homes/<username>/Research/fsom1 receive: rpc error: code = Unknown desc = /homes/<username>/Research/fsom1 is not empty
May 25 08:45:17 bigchin weed[18019]: I0525 08:45:17 18019 weedfs_rename.go:210] Link: dir Rename /homes/<username>/Research/fsom => /homes/<username>/Research/fsom1 receive: rpc error: code = Unknown desc = /homes/<username>/Research/fsom1 is not empty
May 25 08:45:17 bigchin weed[18019]: panic: runtime error: invalid memory address or nil pointer dereference
May 25 08:45:17 bigchin weed[18019]: [signal SIGSEGV: segmentation violation code=0x1 addr=0x8 pc=0x1c0623d]
May 25 08:45:17 bigchin weed[18019]: goroutine 31 [running]:
May 25 08:45:17 bigchin weed[18019]: github.com/chrislusf/seaweedfs/weed/mount.(*InodeToPath).MovePath(0xc00070c030, {0xc0040b5f80, 0x52}, {0xc0006b60c0, 0x58})
May 25 08:45:17 bigchin weed[18019]:         /github/workspace/weed/mount/inode_to_path.go:168 +0x15d
May 25 08:45:17 bigchin weed[18019]: github.com/chrislusf/seaweedfs/weed/mount.(*WFS).handleRenameResponse(0xc000396c60, {0x27e84c0, 0xc00409a800}, 0xc003b6aa50)
May 25 08:45:17 bigchin weed[18019]:         /github/workspace/weed/mount/weedfs_rename.go:236 +0x4ba
May 25 08:45:17 bigchin weed[18019]: github.com/chrislusf/seaweedfs/weed/mount.(*WFS).Rename.func1({0x27fa898, 0xc00404f160})
May 25 08:45:17 bigchin weed[18019]:         /github/workspace/weed/mount/weedfs_rename.go:199 +0x3fc
May 25 08:45:17 bigchin weed[18019]: github.com/chrislusf/seaweedfs/weed/mount.(*WFS).WithFilerClient.func1.1(0xc000127880)
May 25 08:45:17 bigchin weed[18019]:         /github/workspace/weed/mount/wfs_filer_client.go:25 +0x6f
May 25 08:45:17 bigchin weed[18019]: github.com/chrislusf/seaweedfs/weed/pb.WithGrpcClient(0x1a?, 0xc00275fbb0, {0xc0021f8198, 0x14}, {0xc00071dba0?, 0x0?, 0x0?})
May 25 08:45:17 bigchin weed[18019]:         /github/workspace/weed/pb/grpc_client_server.go:145 +0x17f
May 25 08:45:17 bigchin weed[18019]: github.com/chrislusf/seaweedfs/weed/mount.(*WFS).WithFilerClient.func1()
May 25 08:45:17 bigchin weed[18019]:         /github/workspace/weed/mount/wfs_filer_client.go:23 +0x14d
May 25 08:45:17 bigchin weed[18019]: github.com/chrislusf/seaweedfs/weed/util.Retry({0x222d15e, 0xa}, 0xc00275fcc0)
May 25 08:45:17 bigchin weed[18019]:         /github/workspace/weed/util/retry.go:16 +0xb1
May 25 08:45:17 bigchin weed[18019]: github.com/chrislusf/seaweedfs/weed/mount.(*WFS).WithFilerClient(0xc003a72480?, 0x21?, 0x7fb17c7da090?)
May 25 08:45:17 bigchin weed[18019]:         /github/workspace/weed/mount/wfs_filer_client.go:16 +0x66
May 25 08:45:17 bigchin weed[18019]: github.com/chrislusf/seaweedfs/weed/mount.(*WFS).Rename(0xc000396c60, 0x1bef286?, 0xc000894618, {0xc004bb832c, 0x4}, {0xc004bb8340, 0x4})
May 25 08:45:17 bigchin weed[18019]:         /github/workspace/weed/mount/weedfs_rename.go:166 +0x4cd
May 25 08:45:17 bigchin weed[18019]: github.com/hanwen/go-fuse/v2/fuse.doRename2(0xc000894480?, 0xc000894480)
May 25 08:45:17 bigchin weed[18019]:         /go/pkg/mod/github.com/hanwen/go-fuse/v2@v2.1.0/fuse/opcode.go:422 +0x5c
May 25 08:45:17 bigchin weed[18019]: github.com/hanwen/go-fuse/v2/fuse.(*Server).handleRequest(0xc000584b00, 0xc000894480)
May 25 08:45:17 bigchin weed[18019]:         /go/pkg/mod/github.com/hanwen/go-fuse/v2@v2.1.0/fuse/server.go:483 +0x1f3
May 25 08:45:17 bigchin weed[18019]: github.com/hanwen/go-fuse/v2/fuse.(*Server).loop(0xc000584b00, 0x0?)
May 25 08:45:17 bigchin weed[18019]:         /go/pkg/mod/github.com/hanwen/go-fuse/v2@v2.1.0/fuse/server.go:456 +0x108
May 25 08:45:17 bigchin weed[18019]: created by github.com/hanwen/go-fuse/v2/fuse.(*Server).readRequest
May 25 08:45:17 bigchin weed[18019]:         /go/pkg/mod/github.com/hanwen/go-fuse/v2@v2.1.0/fuse/server.go:323 +0x52b
May 25 08:45:17 bigchin systemd[1]: seaweed-mount.service: Main process exited, code=exited, status=2/INVALIDARGUMENT
May 25 08:45:17 bigchin systemd[1]: seaweed-mount.service: Failed with result 'exit-code'.

On the filter at that time (not sure what is up with Cassandra - seeing a lot of I/O timeout errors in the log, but Cassandra reports as healthy):

May 25 08:34:07 aching weed[11727]: I0525 08:34:07 11727 filer_notify.go:103] metadata log write failed /topics/.system/log/2022-05-25/08-33.0ebe7b23: mkdir /topics: insert /topics: gocql: no hosts available in the pool
May 25 08:47:59 aching weed[11727]: I0525 08:47:59 11727 cassandra_store.go:215] list iterator close: read tcp 127.0.0.1:45458->127.0.0.1:9042: i/o timeout
May 25 08:49:18 aching weed[11727]: I0525 08:49:18 11727 filer_delete_entry.go:45] delete directory /homes/<username>/Research/fsom/.git/objects/f5: filer store delete: delete /homes/<username>/Research/fsom/.git/objects/f5 : read tcp 127.0.0.1:46114->127.0.0.1:9042: i/o timeout
May 25 08:52:27 aching weed[11727]: E0525 08:52:27 11727 filer.go:182] insert entry /homes/<username>/Research/fsom/.git/index.lock: insert /homes/<username>/Research/fsom/.git/index.lock: EOF

(I've anonymised the paths)

Did grab the 4 logs from the log directory as well, but don't see any extra information in them. No memory dump occurred due to the crash. But can provide extra logs if useful, including for others nodes etc. if they would be useful.

It should be noted we finally upgraded from 2.65 yesterday - hadn't been able to upgrade for a number reasons, including finding a later version much too unstable to use and just timing (this system has a couple hundred users - downtime is tricky as there is always a conference deadline incoming). We were very much running on fumes at the end there: the nodes it runs on are also the nodes doing the computation, and they are often under extreme loads for sustained periods and it causes issues. My working theory is that mount caches were getting out of sync and not catching up, but whatever the cause there are a lot of inconsistencies left in the file system from several months of this. Entirely possible that the above crash is caused by an assumption that isn't the case due to one of them. I did run all of the healing commands in the weed shell yesterday without issue yesterday however.

chrislusf commented 2 years ago

added code to address this. Please help to verify this in the coming release.