Segmentation fault when leases are enabled

cmolder commented 3 years ago

We're trying to run Assise with leases enabled. We set the -DUSE_LEASE flags in the KernFS and LibFS Makefiles respectively, and set up our file system using emulated NVM as shown in the readme.

Unfortunately, it seems that there is a segmentation fault occuring in iotest when LibFS acquires a lease. This only occurs when leases are active: when they're disabled, the program runs normally. Here is the output of the KernFS and LibFS on the iotest from the readme:

KernFS:

sudo ./run.sh kernfs
initialize file system
dev-dax engine is initialized: dev_path /dev/dax0.0 size 4096 MB
Reading root inode with inum: 1fetching node's IP address..
Process pid is 25584
ip address on interface 'lo' is 127.0.0.1
cluster settings:
--- node 0 - ip:127.0.0.1
MLFS cluster initialized
[Local-Server] Listening on port 12345 for connections. interrupt (^C) to exit.
Adding connection with sockfd: 0
Adding connection with sockfd: 1
Adding connection with sockfd: 2
RECV <-- MSG_INIT [pid 0]
RECV <-- MSG_INIT [pid 1]
RECV <-- MSG_INIT [pid 2]
[add_peer_socket():80] Peer connected (ip: 127.0.0.1, pid: 25597)
[add_peer_socket():97] Established connection with 127.0.0.1 on sock:1 of type:1 and peer:0x7f8baa40f000
[add_peer_socket():97] Established connection with 127.0.0.1 on sock:0 of type:0 and peer:0x7f8baa40f000
SEND --> MSG_SHM [paths: /shm_recv_0|/shm_send_0]
start shmem_poll_loop for sockfd 0
[add_peer_socket():97] Established connection with 127.0.0.1 on sock:2 of type:2 and peer:0x7f8baa40f000
SEND --> MSG_SHM [paths: /shm_recv_2|/shm_send_2]
start shmem_poll_loop for sockfd 2
SEND --> MSG_SHM [paths: /shm_recv_1|/shm_send_1]
start shmem_poll_loop for sockfd 1
[discard_leases():933] discarding all leases for peer ID = 1
Connection terminated [sockfd:0]
Connection terminated [sockfd:1]
Exit server_thread 
Exit server_thread 
./run.sh: line 15: 25584 Segmentation fault      LD_LIBRARY_PATH=../build:../../libfs/lib/nvml/src/nondebug/ LD_PRELOAD=../../libfs/lib/jemalloc-4.5.0/lib/libjemalloc.so.2 MLFS_PROFILE=1 numactl -N0 -m0 $@

LibFS

sudo ./run.sh iotest sw 2G 4K 1
[tid:25597][device_init():50] dev id 1
dev-dax engine is initialized: dev_path /dev/dax0.0 size 4096 MB
[tid:25597][device_init():50] dev id 2
[tid:25597][device_init():50] dev id 3
[tid:25597][cache_init():293] allocating 262144 blocks for DRAM read cache
[tid:25597][read_superblock():504] [dev 1] superblock: size 883712 nblocks 856598 ninodes 300000 inodestart 2 bmap start 27087 datablock_start 27114
fetching node's IP address..
Process pid is 25597
ip address on interface 'lo' is 127.0.0.1
cluster settings:
[tid:25597][register_peer_log():245] assigning peer (ip: 127.0.0.1 pid: 0) to log id 0
--- node 0 - ip:127.0.0.1
Connecting to KernFS instance 0 [ip: 127.0.0.1]
[Local-Client] Creating connection (pid:25597, app_type:0, status:pending) to 127.0.0.1:12345 on sockfd 0
In thread
[Local-Client] Creating connection (pid:25597, app_type:1, status:pending) to 127.0.0.1:12345 on sockfd 1
In thread
[Local-Client] Creating connection (pid:25597, app_type:2, status:pending) to 127.0.0.1:12345 on sockfd 2
[tid:25597][init_rpc():148] awaiting remote KernFS connections
In thread
SEND --> MSG_INIT [pid 0|25597]
SEND --> MSG_INIT [pid 2|25597]
SEND --> MSG_INIT [pid 1|25597]
RECV <-- MSG_SHM [paths: /shm_recv_1|/shm_send_1]
RECV <-- MSG_SHM [paths: /shm_recv_0|/shm_send_0]
[tid:25599][add_peer_socket():63] found socket 1
[tid:25599][_find_peer():176] trying to find peer with ip 127.0.0.1 and pid 0 (peer count: 1 | sock count: 0)
[tid:25599][_find_peer():206] peer[0]: ip 127.0.0.1 pid 0
[add_peer_socket():97] Established connection with 127.0.0.1 on sock:1 of type:1 and peer:0x565200a47d30
start shmem_poll_loop for sockfd 1
RECV <-- MSG_SHM [paths: /shm_recv_2|/shm_send_2]
[tid:25598][add_peer_socket():63] found socket 0
[tid:25600][add_peer_socket():63] found socket 2
[tid:25598][_find_peer():176] trying to find peer with ip 127.0.0.1 and pid 0 (peer count: 1 | sock count: 1)
[tid:25598][_find_peer():191] sockfd[0]: ip 127.0.0.1 pid 0
[add_peer_socket():97] Established connection with 127.0.0.1 on sock:0 of type:0 and peer:0x565200a47d30
start shmem_poll_loop for sockfd 0
[tid:25600][_find_peer():176] trying to find peer with ip 127.0.0.1 and pid 0 (peer count: 1 | sock count: 2)
[tid:25600][_find_peer():191] sockfd[0]: ip 127.0.0.1 pid 0
[add_peer_socket():97] Established connection with 127.0.0.1 on sock:2 of type:2 and peer:0x565200a47d30
start shmem_poll_loop for sockfd 2
[tid:25597][rpc_bootstrap():909] peer send: |bootstrap |25597
[tid:25598][signal_callback():1357] received rpc with body: |bootstrap |1 on sockfd 0
[signal_callback():1370] Assigned LibFS ID=1
MLFS cluster initialized
[tid:25597][init_log():148] end of the log e7c00
init log dev 1 start_blk 916481 end 949248
[tid:25597][ialloc():647] get inode - inum 1
[tid:25597][init_fs():459] LibFS is initialized on dev 1
Total file size: 2147483648B
io size: 4096B
# of thread: 1
[tid:25597][mlfs_posix_mkdir():419] [POSIX] mkdir(/mlfs/)
[tid:25597][namex():274] namex: path /mlfs/, parent 1, name mlfs
[acquire_lease():430] LIBFS ID= 1 trying to acquire lease of type 2 for inum 1
./run.sh: line 5: 25597 Segmentation fault      LD_PRELOAD=../build/libmlfs.so MLFS_PROFILE=1 ${@}

My best guess is that LibFS is having a segfault when it tries to access the lease cache in acquire_lease(). The function calls lcache_find() and looks inside a hash table to find the lease. I have a feeling there is something incorrect about our setup (strata_access@kitten3-lom on the UT cluster), but it might not be since the code works fine with leases off. Do you know where we could start to debug this? Thank you!

simpeter commented 3 years ago

First thing to do is to attach gdb to both processes and get a backtrace plus variable dump of the segfault on both sides. That will give you a lot more indication as to what's going on. You can use it to falsify your hypothesis that the segfault occurs in acquire_lease.

On Wed, Nov 17, 2021 at 3:11 PM Carson Molder @.***> wrote:

We're trying to run Assise with leases enabled. We set the -DUSE_LEASE flags in the KernFS and LibFS Makefiles respectively, and set up our file system using emulated NVM as shown in the readme.

Unfortunately, it seems that there is a segmentation fault occuring in iotest when LibFS acquires a lease. This only occurs when leases are active: when they're disabled, the program runs normally. Here is the output of the KernFS and LibFS on the iotest from the readme:

KernFS:

sudo ./run.sh kernfs initialize file system dev-dax engine is initialized: dev_path /dev/dax0.0 size 4096 MB Reading root inode with inum: 1fetching node's IP address.. Process pid is 25584 ip address on interface 'lo' is 127.0.0.1 cluster settings: --- node 0 - ip:127.0.0.1 MLFS cluster initialized [Local-Server] Listening on port 12345 for connections. interrupt (^C) to exit. Adding connection with sockfd: 0 Adding connection with sockfd: 1 Adding connection with sockfd: 2 RECV <-- MSG_INIT [pid 0] RECV <-- MSG_INIT [pid 1] RECV <-- MSG_INIT [pid 2] [add_peer_socket():80] Peer connected (ip: 127.0.0.1, pid: 25597) [add_peer_socket():97] Established connection with 127.0.0.1 on sock:1 of type:1 and peer:0x7f8baa40f000 [add_peer_socket():97] Established connection with 127.0.0.1 on sock:0 of type:0 and peer:0x7f8baa40f000 SEND --> MSG_SHM [paths: /shm_recv_0|/shm_send_0] start shmem_poll_loop for sockfd 0 [add_peer_socket():97] Established connection with 127.0.0.1 on sock:2 of type:2 and peer:0x7f8baa40f000 SEND --> MSG_SHM [paths: /shm_recv_2|/shm_send_2] start shmem_poll_loop for sockfd 2 SEND --> MSG_SHM [paths: /shm_recv_1|/shm_send_1] start shmem_poll_loop for sockfd 1 [discard_leases():933] discarding all leases for peer ID = 1 Connection terminated [sockfd:0] Connection terminated [sockfd:1] Exit server_thread Exit server_thread ./run.sh: line 15: 25584 Segmentation fault LD_LIBRARY_PATH=../build:../../libfs/lib/nvml/src/nondebug/ LD_PRELOAD=../../libfs/lib/jemalloc-4.5.0/lib/libjemalloc.so.2 MLFS_PROFILE=1 numactl -N0 -m0 $@

LibFS

sudo ./run.sh iotest sw 2G 4K 1 [tid:25597][device_init():50] dev id 1 dev-dax engine is initialized: dev_path /dev/dax0.0 size 4096 MB [tid:25597][device_init():50] dev id 2 [tid:25597][device_init():50] dev id 3 [tid:25597][cache_init():293] allocating 262144 blocks for DRAM read cache [tid:25597][read_superblock():504] [dev 1] superblock: size 883712 nblocks 856598 ninodes 300000 inodestart 2 bmap start 27087 datablock_start 27114 fetching node's IP address.. Process pid is 25597 ip address on interface 'lo' is 127.0.0.1 cluster settings: [tid:25597][register_peer_log():245] assigning peer (ip: 127.0.0.1 pid: 0) to log id 0 --- node 0 - ip:127.0.0.1 Connecting to KernFS instance 0 [ip: 127.0.0.1] [Local-Client] Creating connection (pid:25597, app_type:0, status:pending) to 127.0.0.1:12345 on sockfd 0 In thread [Local-Client] Creating connection (pid:25597, app_type:1, status:pending) to 127.0.0.1:12345 on sockfd 1 In thread [Local-Client] Creating connection (pid:25597, app_type:2, status:pending) to 127.0.0.1:12345 on sockfd 2 [tid:25597][init_rpc():148] awaiting remote KernFS connections In thread SEND --> MSG_INIT [pid 0|25597] SEND --> MSG_INIT [pid 2|25597] SEND --> MSG_INIT [pid 1|25597] RECV <-- MSG_SHM [paths: /shm_recv_1|/shm_send_1] RECV <-- MSG_SHM [paths: /shm_recv_0|/shm_send_0] [tid:25599][add_peer_socket():63] found socket 1 [tid:25599][_find_peer():176] trying to find peer with ip 127.0.0.1 and pid 0 (peer count: 1 | sock count: 0) [tid:25599][_find_peer():206] peer[0]: ip 127.0.0.1 pid 0 [add_peer_socket():97] Established connection with 127.0.0.1 on sock:1 of type:1 and peer:0x565200a47d30 start shmem_poll_loop for sockfd 1 RECV <-- MSG_SHM [paths: /shm_recv_2|/shm_send_2] [tid:25598][add_peer_socket():63] found socket 0 [tid:25600][add_peer_socket():63] found socket 2 [tid:25598][_find_peer():176] trying to find peer with ip 127.0.0.1 and pid 0 (peer count: 1 | sock count: 1) [tid:25598][_find_peer():191] sockfd[0]: ip 127.0.0.1 pid 0 [add_peer_socket():97] Established connection with 127.0.0.1 on sock:0 of type:0 and peer:0x565200a47d30 start shmem_poll_loop for sockfd 0 [tid:25600][_find_peer():176] trying to find peer with ip 127.0.0.1 and pid 0 (peer count: 1 | sock count: 2) [tid:25600][_find_peer():191] sockfd[0]: ip 127.0.0.1 pid 0 [add_peer_socket():97] Established connection with 127.0.0.1 on sock:2 of type:2 and peer:0x565200a47d30 start shmem_poll_loop for sockfd 2 [tid:25597][rpc_bootstrap():909] peer send: |bootstrap |25597 [tid:25598][signal_callback():1357] received rpc with body: |bootstrap |1 on sockfd 0 [signal_callback():1370] Assigned LibFS ID=1 MLFS cluster initialized [tid:25597][init_log():148] end of the log e7c00 init log dev 1 start_blk 916481 end 949248 [tid:25597][ialloc():647] get inode - inum 1 [tid:25597][init_fs():459] LibFS is initialized on dev 1 Total file size: 2147483648B io size: 4096B

of thread: 1

[tid:25597][mlfs_posix_mkdir():419] [POSIX] mkdir(/mlfs/) [tid:25597][namex():274] namex: path /mlfs/, parent 1, name mlfs [acquire_lease():430] LIBFS ID= 1 trying to acquire lease of type 2 for inum 1 ./run.sh: line 5: 25597 Segmentation fault LD_PRELOAD=../build/libmlfs.so MLFS_PROFILE=1 ${@}

My best guess is that LibFS is having a segfault when it tries to access the lease cache in acquire_lease() https://github.com/ut-osa/assise/blob/master/libfs/src/experimental/leases.c#L432. The function calls lcache_find() https://github.com/ut-osa/assise/blob/master/libfs/src/experimental/leases.c#L114 and looks inside a hash table to find the lease. I have a feeling there is something incorrect about our setup @.*** on the UT cluster), but it might not be since the code works fine with leases off. Do you know where we could start to debug this? Thank you!

— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub https://github.com/ut-osa/assise/issues/18, or unsubscribe https://github.com/notifications/unsubscribe-auth/ABHQBMT7XMTQKQMHZCKUTI3UMQK6ZANCNFSM5IH767QQ . Triage notifications on the go with GitHub Mobile for iOS https://apps.apple.com/app/apple-store/id1477376905?ct=notification-email&mt=8&pt=524675 or Android https://play.google.com/store/apps/details?id=com.github.android&referrer=utm_campaign%3Dnotification-email%26utm_medium%3Demail%26utm_source%3Dgithub.

jdeans289 commented 3 years ago

We're having trouble getting meaningful results from gdb - we've recompiled both sides with the -g flag, and are attaching gdb by pid after starting the process. However, our backtraces look something like

Thread 1 "iotest" received signal SIGSEGV, Segmentation fault.
0x00007f7844fd662c in ?? ()
(gdb) backtrace
#0  0x00007f7844fd662c in ?? ()
#1  0x00007fff75f01904 in ?? ()
#2  0x00007fff75f01a40 in ?? ()
#3  0x5366922b64c30614 in ?? ()
#4  0x0000000000000000 in ?? ()

Is there some other way that we need to import symbols or run gdb? I have not been able to set breakpoints by name either.

In the meantime, we tried some "printf debugging" to narrow down where segfault is coming from, and it appears to be coming from:

https://github.com/ut-osa/assise/blob/master/libfs/src/experimental/leases.c#L120

Specifically that the hash_handle is not defined.

simpeter commented 3 years ago

Waleed (cc'ed) might be able to help here.

I imagine lcache_find will return NULL and instead of segfaulting, libfs should at least assert that the return value is non-NULL. The next step would be to figure out why the return value is NULL and what should be done in that case.

-- Simon

On Thu, Nov 18, 2021 at 3:24 PM Jacqueline Deans @.***> wrote:

We're having trouble getting meaningful results from gdb - we've recompiled both sides with the -g flag, and are attaching gdb by pid after starting the process. However, our backtraces look something like

Thread 1 "iotest" received signal SIGSEGV, Segmentation fault. 0x00007f7844fd662c in ?? () (gdb) backtrace

0 0x00007f7844fd662c in ?? ()

1 0x00007fff75f01904 in ?? ()

2 0x00007fff75f01a40 in ?? ()

3 0x5366922b64c30614 in ?? ()

4 0x0000000000000000 in ?? ()

Is there some other way that we need to import symbols or run gdb? I have not been able to set breakpoints by name either.

In the meantime, we tried some "printf debugging" to narrow down where segfault is coming from, and it appears to be coming from:

https://github.com/ut-osa/assise/blob/master/libfs/src/experimental/leases.c#L120

Specifically that the hash_handle is not defined.

— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/ut-osa/assise/issues/18#issuecomment-973282870, or unsubscribe https://github.com/notifications/unsubscribe-auth/ABHQBMQZSE63MMIVU765YBTUMVVGFANCNFSM5IH767QQ . Triage notifications on the go with GitHub Mobile for iOS https://apps.apple.com/app/apple-store/id1477376905?ct=notification-email&mt=8&pt=524675 or Android https://play.google.com/store/apps/details?id=com.github.android&referrer=utm_campaign%3Dnotification-email%26utm_medium%3Demail%26utm_source%3Dgithub.

wreda commented 2 years ago

I believe I found the issue. There was a bug that resulted in the lease table not being properly initialized. I just pushed a fix for it. Please pull and let me know if you run into any other problems.

cmolder commented 2 years ago

Your fix works great. Thank you!

ut-osa / assise

Segmentation fault when leases are enabled #18

of thread: 1

0 0x00007f7844fd662c in ?? ()

1 0x00007fff75f01904 in ?? ()

2 0x00007fff75f01a40 in ?? ()

3 0x5366922b64c30614 in ?? ()

4 0x0000000000000000 in ?? ()