Setting up Cluster with Multiple Nodes - Segmentation Fault

agnesnatasya commented 2 years ago

Hi,

Setup

I am trying to set up a simple cluster with 2 nodes. These are the network interfaces of each node:

Node 1

eno33: 128.110.219.19 enp65s0f0: 10.10.1.2

Node 2

eno33: 128.110.219.27 enp65s0f0: 10.10.1.3

In each of these node, I set g_n_hot_rep to 2 and RPC interface to

static struct peer_id hot_replicas[g_n_hot_rep] = {                                                         
 { .ip = "10.10.1.2", .role = HOT_REPLICA, .type = KERNFS_PEER},                                   
 { .ip = "10.10.1.3", .role = HOT_REPLICA, .type = KERNFS_PEER},
};

I run KernFS starting from the node that has 10.10.1.3 as its interface.

Result

I received a segmentation fault

initialize file system
dev-dax engine is initialized: dev_path /dev/dax0.0 size 8192 MB
Reading root inode with inum: 1fetching node's IP address..
Process pid is 4013
ip address on interface 'ib0' is 10.10.1.2
cluster settings:
--- node 0 - ip:10.10.1.2
--- node 1 - ip:10.10.1.3
Connecting to KernFS instance 1 [ip: 10.10.1.3]
./run.sh: line 15:  4013 Segmentation fault      LD_LIBRARY_PATH=../build:../../libfs/lib/nvml/src/nondebug/ LD_PRELOAD=../../libfs/lib/jemalloc-4.5.0/lib/libjemalloc.so.2 MLFS_PROFILE=1 numactl -N0 -m0 $@

Debugging

After debugging, it looks like the segmentation fault comes in libfs/lib/rdma/agent.c line 96 and line 130, the rdma_cm_id struct after rdma_create_id is NULL. I also run the filesystem as a local file system, where g_n_hot_rep = 1 and RPC interface is set to localhost, and it works

Do you mind helping me with this problem? Thank you very much!

wreda commented 2 years ago

I think this is likely due to Assise not finding the proper interface. Can you change rdma_intf at rpc_interface.h#L24 to your RDMA network interface name and rebuild? I presume in your case that should be enp65s0f0.

agnesnatasya commented 2 years ago

Hi Waleed,

Result

Thank you very much for your help, I set rdma_intf = enp65s0f0 on both nodes, and I changed the utils/rdma_setup.sh from ib0 to enp65s0f0 too but it still seg faults.

The error message is a little bit different, on the 10.10.1.3 node, it says

initialize file system
dev-dax engine is initialized: dev_path /dev/dax0.0 size 8192 MB
Reading root inode with inum: 1fetching node's IP address..
Process pid is 19046
ip address on interface 'enp65s0f0' is 10.10.1.3
cluster settings:
--- node 0 - ip:10.10.1.2
--- node 1 - ip:10.10.1.3
./run.sh: line 15: 19046 Segmentation fault      LD_LIBRARY_PATH=../build:../../libfs/lib/nvml/src/nondebug/ LD_PRELOAD=../../libfs/lib/jemalloc-4.5.0/lib/libjemalloc.so.2 MLFS_PROFILE=1 numactl -N0 -m0 $@

On the 10.10.1.2 node it says

initialize file system
dev-dax engine is initialized: dev_path /dev/dax0.0 size 8192 MB
Reading root inode with inum: 1fetching node's IP address..
Process pid is 9886
ip address on interface 'enp65s0f0' is 10.10.1.2
cluster settings:
--- node 0 - ip:10.10.1.2
--- node 1 - ip:10.10.1.3
Connecting to KernFS instance 1 [ip: 10.10.1.3]
./run.sh: line 15:  9886 Segmentation fault      LD_LIBRARY_PATH=../build:../../libfs/lib/nvml/src/nondebug/ LD_PRELOAD=../../libfs/lib/jemalloc-4.5.0/lib/libjemalloc.so.2 MLFS_PROFILE=1 numactl -N0 -m0 $@

There is an additional Connecting to KernFS instance 1 [ip: 10.10.1.3].

Debugging

Through GDB, it also looks like the rdma_cm_id struct is still NULL when rdma_bind_addr or rdma_resolve_addr is called. The values of the other seen variables are as follows add_connection (ip=0x7ffff5335124 "10.10.1.3", port=0x7ffff521f010 "12345", app_type=0, pid=0, ch_type=<optimized out>, polling_loop=1) and addr= {sin6_family = 10, sin6_port = 0, sin6_flowinfo = 0, sin6_addr = {__in6_u = {__u6_addr8 = '\000' <repeats 15 times>, __u6_addr16 = {0, 0, 0, 0, 0, 0, 0, 0}, __u6_addr32 = {0, 0, 0, 0}}}, sin6_scope_id = 0}

Do you happen to know what is the cause of this problem? Does it have something to do with connecting to port on the other node? I have allowed port 12345 on both nodes.

Thank you very much for your help!

wreda commented 2 years ago

Thanks for the debugging effort! I suspect this is likely a firewall issue.

To test connectivity, you can try running the RPC application in lib/rdma/tests/ and see if it also produces an error. You can use the following commands: ./rpc_client <ip> <port> <iters> and ./rpc_server <port> . I've added additional checks to libfs/lib/rdma/agent.c to avoid segfaults; the error codes might help indicate the issue.

agnesnatasya commented 2 years ago

Hi Waleed,

Thank you very much for the checks in libfs/lib/rdma/agent.c! After running using the new version, I received an error code 19, it looks like Assise is unable to find the device.

Debugging

Here are some of my debugging effort

I traced again using GDB, I found out that rdma_event_channel ec is NULL when rdma_create_id() is called, which I suspect might be the reason why rdma_create_id() failed. After that call, the returned result is -1, error code is 19 and rdma_cm_id = NULL
- I tried to changelibfs/lib/rdma-core/librdmacm/cma.c's rdma_create_event_channel() function
  - I changed the device name from /dev/infiniband/rdma_cm to /dev/dax0.0 (the name of the DAX in my machine)
  - I printed some lines but it does not print out anything, I think some of the binaries might not be removed during make clean and hence not rebuilt during cd deps; ./install_deps.sh; cd ... However, I do check the libfs/lib/rdma-core/build and it's properly rebuilt, so I'm not too sure what's the cause of it not showing my newest change tot he code.
- I am also a little bit not sure about the LD_PRELOAD variable. Is it supposed to be LD_PRELOAD=../../libfs/lib/jemalloc-4.5.0/lib/libjemalloc.so.2 or LD_PRELOAD="../../libfs/lib/jemalloc-4.5.0/lib/libjemalloc.so.2 ../../libfs/build/libmlfs.so"?
I also thought of another point of failure, which is sockaddr_in6 addr, which is a IPv6 socket, while the IP that I provide in rpc_interface.h is IPv4. However, I think this is not the problem that causes rdma_create_id() to fail because this function does not use the variable addr

Changes

Regarding your previous suggestion on firewall, it was a great suggestion, thank you! I realised the firewall enabled was on a different network interface. I've enabled incoming and outgoing to and from port 12345 for both nodes on the network interface used by the RDMA enp65s0f0 in my case, but I still receive the above error.

Further information

I am also using an NVM emulation instead of an actual NVM.

Do you have any idea regarding the above error? Thank you very much for your help!

wreda commented 2 years ago

I assume you weren't able to run the RPC test. If so, then the error is not Assise-related. The LD_PRELOAD or use of emulated NVM shouldn't be a factor here.

I haven't encountered this particular error myself but, if I had to guess, it could simply be a driver issue. It might make sense to first check whether the MLNX_OFED drivers are properly installed and that the required modules are loaded in your kernel (e.g. libmlx5, libmlx4). That could be the culprit. If that doesn't help, you can try posting this on the Mellanox community forums.

agnesnatasya commented 2 years ago

Hi Waleed,

Thank you very much, it was indeed the error, I did not have my RDMA set up yet, I was not aware about it during the setup. Do you mind if I add a sentence or two mentioning that properly configured RDMA device and interfaces is a prerequisite?

wreda commented 2 years ago

Thanks for confirming.

Do you mind if I add a sentence or two mentioning that properly configured RDMA device and interfaces is a prerequisite?

Absolutely! The README can definitely benefit from this. Feel free to do a pull request and I'll merge.

agnesnatasya commented 2 years ago

Thank you Waleed for that!

Do you mind if I clarify some things with regards to Assise to help me write a proper additional setup instruction?

I assume that the KernFS in this repository is equivalent to the SharedFS in the original paper. Is this correct?
I am a little bit confused why there isn't a cluster manager in this Github setup. Is it because this prototype only supports hot replicas, and that every nodes defined in rpc_interface.h hot_replicas[] are hot replicas, hence there is no need to set up a separate cluster manager?
Is all nodes part of the all the other nodes' replication chain in the general workload setup? Or is this supposed to be determined by the cluster manager's policy? If my assumptions on question 2 is correct, is all nodes in hot_replicas[] part of all the other nodes replication chain since there is no cluster manager

Thank you very much Waleed for your kind help in clarifying about this!

wreda commented 2 years ago

Sorry for the delayed reply! Last few weeks were hectic.

I assume that the KernFS in this repository is equivalent to the SharedFS in the original paper. Is this correct?

Yes, that's correct.

I am a little bit confused why there isn't a cluster manager in this Github setup. Is it because this prototype only supports hot replicas, and that every nodes defined in rpc_interface.h hot_replicas[] are hot replicas, hence there is no need to set up a separate cluster manager?

Our prototype currently doesn't come with an interface to the cluster manager (zookeeper). Only hot replicas, as you noted, are supported as of now.

Is all nodes part of the all the other nodes' replication chain in the general workload setup? Or is this supposed to be determined by the cluster manager's policy? If my assumptions on question 2 is correct, is all nodes in hot_replicas[] part of all the other nodes replication chain since there is no cluster manager

Thank you very much Waleed for your kind help in clarifying about this!

Correct, all nodes defined in hot_replicas are part of the same replica group.

agnesnatasya commented 2 years ago

Thanks a lot Waleed for the clarification!

caposerenity commented 2 years ago

Hi Waleed,

Thank you very much, it was indeed the error, I did not have my RDMA set up yet, I was not aware about it during the setup. Do you mind if I add a sentence or two mentioning that properly configured RDMA device and interfaces is a prerequisite?

@agnesnatasya Hi, I met the same problem of segmentation fault, and I found that it seems to be caused by rdma_cm_id = NULL . Could you please tell more details about your solution of setting up RDMA? Thanks a lot~

agnesnatasya commented 2 years ago

Hi @caposerenity! Sure! For me, I have a lab cluster that has Mellanox adapter installed on it, and also the Infiniband drivers installed. I use that to establish the RDMA connection between the nodes. If you have machines with Mellanox adapter installed, but not the drivers, you can try installing the driver through some online guides, depending on the version of the device, one of the documentations is here https://network.nvidia.com/related-docs/prod_software/Mellanox_IB_OFED_Driver_for_VMware_vSphere_User_Manual_Rev_1_8_1.pdf, but you can also find more casual tutorials online. If you do not have machines with Mellanox adapter, I am not sure if there is a workaround. You can definitely run a single node Assise, which is similar to Strata (a local filesystem).

ut-osa / assise