Open cmolder opened 3 years ago
First thing to do is to attach gdb to both processes and get a backtrace plus variable dump of the segfault on both sides. That will give you a lot more indication as to what's going on. You can use it to falsify your hypothesis that the segfault occurs in acquire_lease.
On Wed, Nov 17, 2021 at 3:11 PM Carson Molder @.***> wrote:
We're trying to run Assise with leases enabled. We set the -DUSE_LEASE flags in the KernFS and LibFS Makefiles respectively, and set up our file system using emulated NVM as shown in the readme.
Unfortunately, it seems that there is a segmentation fault occuring in iotest when LibFS acquires a lease. This only occurs when leases are active: when they're disabled, the program runs normally. Here is the output of the KernFS and LibFS on the iotest from the readme:
KernFS:
sudo ./run.sh kernfs initialize file system dev-dax engine is initialized: dev_path /dev/dax0.0 size 4096 MB Reading root inode with inum: 1fetching node's IP address.. Process pid is 25584 ip address on interface 'lo' is 127.0.0.1 cluster settings: --- node 0 - ip:127.0.0.1 MLFS cluster initialized [Local-Server] Listening on port 12345 for connections. interrupt (^C) to exit. Adding connection with sockfd: 0 Adding connection with sockfd: 1 Adding connection with sockfd: 2 RECV <-- MSG_INIT [pid 0] RECV <-- MSG_INIT [pid 1] RECV <-- MSG_INIT [pid 2] [add_peer_socket():80] Peer connected (ip: 127.0.0.1, pid: 25597) [add_peer_socket():97] Established connection with 127.0.0.1 on sock:1 of type:1 and peer:0x7f8baa40f000 [add_peer_socket():97] Established connection with 127.0.0.1 on sock:0 of type:0 and peer:0x7f8baa40f000 SEND --> MSG_SHM [paths: /shm_recv_0|/shm_send_0] start shmem_poll_loop for sockfd 0 [add_peer_socket():97] Established connection with 127.0.0.1 on sock:2 of type:2 and peer:0x7f8baa40f000 SEND --> MSG_SHM [paths: /shm_recv_2|/shm_send_2] start shmem_poll_loop for sockfd 2 SEND --> MSG_SHM [paths: /shm_recv_1|/shm_send_1] start shmem_poll_loop for sockfd 1 [discard_leases():933] discarding all leases for peer ID = 1 Connection terminated [sockfd:0] Connection terminated [sockfd:1] Exit server_thread Exit server_thread ./run.sh: line 15: 25584 Segmentation fault LD_LIBRARY_PATH=../build:../../libfs/lib/nvml/src/nondebug/ LD_PRELOAD=../../libfs/lib/jemalloc-4.5.0/lib/libjemalloc.so.2 MLFS_PROFILE=1 numactl -N0 -m0 $@
LibFS
sudo ./run.sh iotest sw 2G 4K 1 [tid:25597][device_init():50] dev id 1 dev-dax engine is initialized: dev_path /dev/dax0.0 size 4096 MB [tid:25597][device_init():50] dev id 2 [tid:25597][device_init():50] dev id 3 [tid:25597][cache_init():293] allocating 262144 blocks for DRAM read cache [tid:25597][read_superblock():504] [dev 1] superblock: size 883712 nblocks 856598 ninodes 300000 inodestart 2 bmap start 27087 datablock_start 27114 fetching node's IP address.. Process pid is 25597 ip address on interface 'lo' is 127.0.0.1 cluster settings: [tid:25597][register_peer_log():245] assigning peer (ip: 127.0.0.1 pid: 0) to log id 0 --- node 0 - ip:127.0.0.1 Connecting to KernFS instance 0 [ip: 127.0.0.1] [Local-Client] Creating connection (pid:25597, app_type:0, status:pending) to 127.0.0.1:12345 on sockfd 0 In thread [Local-Client] Creating connection (pid:25597, app_type:1, status:pending) to 127.0.0.1:12345 on sockfd 1 In thread [Local-Client] Creating connection (pid:25597, app_type:2, status:pending) to 127.0.0.1:12345 on sockfd 2 [tid:25597][init_rpc():148] awaiting remote KernFS connections In thread SEND --> MSG_INIT [pid 0|25597] SEND --> MSG_INIT [pid 2|25597] SEND --> MSG_INIT [pid 1|25597] RECV <-- MSG_SHM [paths: /shm_recv_1|/shm_send_1] RECV <-- MSG_SHM [paths: /shm_recv_0|/shm_send_0] [tid:25599][add_peer_socket():63] found socket 1 [tid:25599][_find_peer():176] trying to find peer with ip 127.0.0.1 and pid 0 (peer count: 1 | sock count: 0) [tid:25599][_find_peer():206] peer[0]: ip 127.0.0.1 pid 0 [add_peer_socket():97] Established connection with 127.0.0.1 on sock:1 of type:1 and peer:0x565200a47d30 start shmem_poll_loop for sockfd 1 RECV <-- MSG_SHM [paths: /shm_recv_2|/shm_send_2] [tid:25598][add_peer_socket():63] found socket 0 [tid:25600][add_peer_socket():63] found socket 2 [tid:25598][_find_peer():176] trying to find peer with ip 127.0.0.1 and pid 0 (peer count: 1 | sock count: 1) [tid:25598][_find_peer():191] sockfd[0]: ip 127.0.0.1 pid 0 [add_peer_socket():97] Established connection with 127.0.0.1 on sock:0 of type:0 and peer:0x565200a47d30 start shmem_poll_loop for sockfd 0 [tid:25600][_find_peer():176] trying to find peer with ip 127.0.0.1 and pid 0 (peer count: 1 | sock count: 2) [tid:25600][_find_peer():191] sockfd[0]: ip 127.0.0.1 pid 0 [add_peer_socket():97] Established connection with 127.0.0.1 on sock:2 of type:2 and peer:0x565200a47d30 start shmem_poll_loop for sockfd 2 [tid:25597][rpc_bootstrap():909] peer send: |bootstrap |25597 [tid:25598][signal_callback():1357] received rpc with body: |bootstrap |1 on sockfd 0 [signal_callback():1370] Assigned LibFS ID=1 MLFS cluster initialized [tid:25597][init_log():148] end of the log e7c00 init log dev 1 start_blk 916481 end 949248 [tid:25597][ialloc():647] get inode - inum 1 [tid:25597][init_fs():459] LibFS is initialized on dev 1 Total file size: 2147483648B io size: 4096B
of thread: 1
[tid:25597][mlfs_posix_mkdir():419] [POSIX] mkdir(/mlfs/) [tid:25597][namex():274] namex: path /mlfs/, parent 1, name mlfs [acquire_lease():430] LIBFS ID= 1 trying to acquire lease of type 2 for inum 1 ./run.sh: line 5: 25597 Segmentation fault LD_PRELOAD=../build/libmlfs.so MLFS_PROFILE=1 ${@}
My best guess is that LibFS is having a segfault when it tries to access the lease cache in acquire_lease() https://github.com/ut-osa/assise/blob/master/libfs/src/experimental/leases.c#L432. The function calls lcache_find() https://github.com/ut-osa/assise/blob/master/libfs/src/experimental/leases.c#L114 and looks inside a hash table to find the lease. I have a feeling there is something incorrect about our setup @.*** on the UT cluster), but it might not be since the code works fine with leases off. Do you know where we could start to debug this? Thank you!
— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub https://github.com/ut-osa/assise/issues/18, or unsubscribe https://github.com/notifications/unsubscribe-auth/ABHQBMT7XMTQKQMHZCKUTI3UMQK6ZANCNFSM5IH767QQ . Triage notifications on the go with GitHub Mobile for iOS https://apps.apple.com/app/apple-store/id1477376905?ct=notification-email&mt=8&pt=524675 or Android https://play.google.com/store/apps/details?id=com.github.android&referrer=utm_campaign%3Dnotification-email%26utm_medium%3Demail%26utm_source%3Dgithub.
We're having trouble getting meaningful results from gdb - we've recompiled both sides with the -g flag, and are attaching gdb by pid after starting the process. However, our backtraces look something like
Thread 1 "iotest" received signal SIGSEGV, Segmentation fault.
0x00007f7844fd662c in ?? ()
(gdb) backtrace
#0 0x00007f7844fd662c in ?? ()
#1 0x00007fff75f01904 in ?? ()
#2 0x00007fff75f01a40 in ?? ()
#3 0x5366922b64c30614 in ?? ()
#4 0x0000000000000000 in ?? ()
Is there some other way that we need to import symbols or run gdb? I have not been able to set breakpoints by name either.
In the meantime, we tried some "printf debugging" to narrow down where segfault is coming from, and it appears to be coming from:
https://github.com/ut-osa/assise/blob/master/libfs/src/experimental/leases.c#L120
Specifically that the hash_handle is not defined.
Waleed (cc'ed) might be able to help here.
I imagine lcache_find will return NULL and instead of segfaulting, libfs should at least assert that the return value is non-NULL. The next step would be to figure out why the return value is NULL and what should be done in that case.
-- Simon
On Thu, Nov 18, 2021 at 3:24 PM Jacqueline Deans @.***> wrote:
We're having trouble getting meaningful results from gdb - we've recompiled both sides with the -g flag, and are attaching gdb by pid after starting the process. However, our backtraces look something like
Thread 1 "iotest" received signal SIGSEGV, Segmentation fault. 0x00007f7844fd662c in ?? () (gdb) backtrace
0 0x00007f7844fd662c in ?? ()
1 0x00007fff75f01904 in ?? ()
2 0x00007fff75f01a40 in ?? ()
3 0x5366922b64c30614 in ?? ()
4 0x0000000000000000 in ?? ()
Is there some other way that we need to import symbols or run gdb? I have not been able to set breakpoints by name either.
In the meantime, we tried some "printf debugging" to narrow down where segfault is coming from, and it appears to be coming from:
https://github.com/ut-osa/assise/blob/master/libfs/src/experimental/leases.c#L120
Specifically that the hash_handle is not defined.
— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/ut-osa/assise/issues/18#issuecomment-973282870, or unsubscribe https://github.com/notifications/unsubscribe-auth/ABHQBMQZSE63MMIVU765YBTUMVVGFANCNFSM5IH767QQ . Triage notifications on the go with GitHub Mobile for iOS https://apps.apple.com/app/apple-store/id1477376905?ct=notification-email&mt=8&pt=524675 or Android https://play.google.com/store/apps/details?id=com.github.android&referrer=utm_campaign%3Dnotification-email%26utm_medium%3Demail%26utm_source%3Dgithub.
I believe I found the issue. There was a bug that resulted in the lease table not being properly initialized. I just pushed a fix for it. Please pull and let me know if you run into any other problems.
Your fix works great. Thank you!
We're trying to run Assise with leases enabled. We set the
-DUSE_LEASE
flags in the KernFS and LibFS Makefiles respectively, and set up our file system using emulated NVM as shown in the readme.Unfortunately, it seems that there is a segmentation fault occuring in iotest when LibFS acquires a lease. This only occurs when leases are active: when they're disabled, the program runs normally. Here is the output of the KernFS and LibFS on the iotest from the readme:
KernFS:
LibFS
My best guess is that LibFS is having a segfault when it tries to access the lease cache in acquire_lease(). The function calls lcache_find() and looks inside a hash table to find the lease. I have a feeling there is something incorrect about our setup (strata_access@kitten3-lom on the UT cluster), but it might not be since the code works fine with leases off. Do you know where we could start to debug this? Thank you!