v6d-io / v6d

vineyard (v6d): an in-memory immutable data manager. (Project under CNCF, TAG-Storage)
https://v6d.io
Apache License 2.0
838 stars 121 forks source link

It has RDMA net device in continer,but " Init RDMA failed!Create rdma server failed!" Why? #2004

Open hsh258 opened 1 month ago

hsh258 commented 1 month ago

rpc_server.cc:112] Init RDMA failed!Create rdma server failed!

Describe your problem

A clear and concise description of what your problem is. It might be a bug, a feature request, or just a problem that need support from the vineyard team.


If is is a bug report, to help us reproducing this bug, please provide information below:

  1. Your Operation System version (uname -a):
  2. The version of vineyard you use (vineyard.__version__):
  3. Versions of crucial packages, such as gcc, numpy, pandas, etc.:
  4. Full stack of the error (if there are a crash):
  5. Minimized code to reproduce the error:

If it is a feature request, please provides a clear and concise description of what you want to happen:

What is the problem:

The behaviour that you expect to work:

Additional context

Add any other context about the problem here.

dashanji commented 1 month ago

Hi @hsh258, could you please use something like ib_write_bw or lib-fabric to check whether the rdma dev can work.

hsh258 commented 1 month ago

Hi @hsh258, could you please use something like ib_write_bw or lib-fabric to check whether the rdma dev can work.

Hi,there are some details: scene:in container fi_getinfo: return -FI_ENODATA

find / -name 'librdmacm*' 2>/dev/null /var/lib/dpkg/info/librdmacm1:amd64.shlibs /var/lib/dpkg/info/librdmacm1:amd64.triggers /var/lib/dpkg/info/librdmacm1:amd64.symbols /var/lib/dpkg/info/librdmacm1:amd64.md5sums /var/lib/dpkg/info/librdmacm1:amd64.list /var/cache/apt/archives/librdmacm1_28.0-1ubuntu1_amd64.deb /usr/lib/x86_64-linux-gnu/librdmacm.so /usr/lib/x86_64-linux-gnu/librdmacm.so.1.2.28.0 /usr/lib/x86_64-linux-gnu/librdmacm.so.1 /usr/share/doc/librdmacm1

find / -name 'libibverbs*' 2>/dev/null /etc/libibverbs.d /var/lib/dpkg/info/libibverbs1:amd64.md5sums /var/lib/dpkg/info/libibverbs1:amd64.shlibs /var/lib/dpkg/info/libibverbs-dev:amd64.list /var/lib/dpkg/info/libibverbs1:amd64.list /var/lib/dpkg/info/libibverbs-dev:amd64.md5sums /var/lib/dpkg/info/libibverbs1:amd64.postinst /var/lib/dpkg/info/libibverbs1:amd64.symbols /var/lib/dpkg/info/libibverbs1:amd64.triggers /var/cache/apt/archives/libibverbs1_28.0-1ubuntu1_amd64.deb /var/cache/apt/archives/libibverbs-dev_28.0-1ubuntu1_amd64.deb /usr/lib/x86_64-linux-gnu/pkgconfig/libibverbs.pc /usr/lib/x86_64-linux-gnu/libibverbs.so /usr/lib/x86_64-linux-gnu/libibverbs.a /usr/lib/x86_64-linux-gnu/libibverbs /usr/lib/x86_64-linux-gnu/libibverbs.so.1.8.28.0 /usr/lib/x86_64-linux-gnu/libibverbs.so.1 /usr/share/doc/libibverbs1 /usr/share/doc/libibverbs-dev

apt-cache search libfabric libfabric-bin - Diagnosis programs for the libfabric communication library libfabric-dev - Development files for libfabric1 libfabric1 - libfabric communication library

dpkg -l | grep libfabric ii libfabric1 1.6.2-3ubuntu0.1 amd64 libfabric communication library

however, it has no fi_info tool, can't check"fi_info -p verbs"

about "whether the rdma dev can work": the rdma dev can work, surely

fi_info /usr/local/bin/.libs/fi_info: /lib/x86_64-linux-gnu/libfabric.so.1: version FABRIC_1.4' not found (required by /usr/local/bin/.libs/fi_info) /usr/local/bin/.libs/fi_info: /lib/x86_64-linux-gnu/libfabric.so.1: versionFABRIC_1.7' not found (required by /usr/local/bin/.libs/fi_info) root@d0cf4f0fd8bb:/usr/local/bin# find / -name 'libfabric.so' 2>/dev/null /usr/lib/x86_64-linux-gnu/libfabric.so.1 /usr/lib/x86_64-linux-gnu/libfabric.so.1.9.15 root@d0cf4f0fd8bb:/usr/local/bin# ls -l /usr/lib/x86_64-linux-gnu/libfabric.so.1 lrwxrwxrwx 1 root root 19 Nov 30 2022 /usr/lib/x86_64-linux-gnu/libfabric.so.1 -> libfabric.so.1.9.15

vegetableysm commented 1 month ago

apt-cache search libfabric

Hi!Could you give me more details? For example, specific error messages like this: image

And could you give me your command to run vineyardd? Thanks.

By the way, you can install the fabtest to run fi_info. Please make sure the fabtest version is compatible with libfabric.

hsh258 commented 1 month ago

apt-cache search libfabric

Hi!Could you give me more details? For example, specific error messages like this: image

And could you give me your command to run vineyardd? Thanks.

By the way, you can install the fabtest to run fi_info. Please make sure the fabtest version is compatible with libfabric.

Hi, Here is command ./vineyardd --rdma_endpoint fd00:80:2200:3205::1207:b02

fi_info -p verbs fi_getinfo: -61 (No data available) lrwxrwxrwx. 1 root root 19 Oct 26 09:17 libfabric.so -> libfabric.so.1.24.0 lrwxrwxrwx. 1 root root 19 Oct 26 09:17 libfabric.so.1 -> libfabric.so.1.24.0 -rwxr-xr-x. 1 root root 1187520 Oct 26 09:17 libfabric.so.1.24.0

vegetableysm commented 1 month ago

apt-cache search libfabric

Hi!Could you give me more details? For example, specific error messages like this: image And could you give me your command to run vineyardd? Thanks. By the way, you can install the fabtest to run fi_info. Please make sure the fabtest version is compatible with libfabric.

Hi, Here is command ./vineyardd --rdma_endpoint fd00:80:2200:3205::1207:b02

fi_info -p verbs fi_getinfo: -61 (No data available) lrwxrwxrwx. 1 root root 19 Oct 26 09:17 libfabric.so -> libfabric.so.1.24.0 lrwxrwxrwx. 1 root root 19 Oct 26 09:17 libfabric.so.1 -> libfabric.so.1.24.0 -rwxr-xr-x. 1 root root 1187520 Oct 26 09:17 libfabric.so.1.24.0

Does "fd00:80:2200:3205::1207:b02" is an ipv6 address? Currently vineyard does not support ipv6 address resolution, please try it again with ipv4 address. Additionally, rdma devices requires root privileges. Are you doing this as root?

By the way, RDMA module of vineyard is based on libfabric, so if the fabric component "fi_info" can't see the information of RDMA device, vineyard can't get it either.

vegetableysm commented 1 month ago

In addition, the param of "--rdma_endpoint" needs to specify port information for the address. Such as: ./vineyardd --rdma_endpoint=ipv4_addr:port

hsh258 commented 1 month ago

apt-cache search libfabric

Hi!Could you give me more details? For example, specific error messages like this: image And could you give me your command to run vineyardd? Thanks. By the way, you can install the fabtest to run fi_info. Please make sure the fabtest version is compatible with libfabric.

Hi, Here is command ./vineyardd --rdma_endpoint fd00:80:2200:3205::1207:b02 fi_info -p verbs fi_getinfo: -61 (No data available) lrwxrwxrwx. 1 root root 19 Oct 26 09:17 libfabric.so -> libfabric.so.1.24.0 lrwxrwxrwx. 1 root root 19 Oct 26 09:17 libfabric.so.1 -> libfabric.so.1.24.0 -rwxr-xr-x. 1 root root 1187520 Oct 26 09:17 libfabric.so.1.24.0

Does "fd00:80:2200:3205::1207:b02" is an ipv6 address? Currently vineyard does not support ipv6 address resolution, please try it again with ipv4 address. Additionally, rdma devices requires root privileges. Are you doing this as root?

By the way, RDMA module of vineyard is based on libfabric, so if the fabric component "fi_info" can't see the information of RDMA device, vineyard can't get it either.

Hi, it is ip6 address。as root login detail error info: libfabric:2795:1730110628::core:core:fi_paramget():372 variable perf_cntr= libfabric:2795:1730110628::core:core:fi_paramget():372 variable hook= libfabric:2795:1730110628::core:core:fi_paramget():372 variable hmem= libfabric:2795:1730110628::core:core:ofi_hmem_init():658 Hmem iface FI_HMEM_CUDA not supported libfabric:2795:1730110628::core:core:ofi_hmem_init():658 Hmem iface FI_HMEM_ROCR not supported libfabric:2795:1730110628::core:core:ofi_hmem_init():658 Hmem iface FI_HMEM_ZE not supported libfabric:2795:1730110628::core:core:ofi_hmem_init():658 Hmem iface FI_HMEM_NEURON not supported libfabric:2795:1730110628::core:core:ofi_hmem_init():658 Hmem iface FI_HMEM_SYNAPSEAI not supported libfabric:2795:1730110628::core:core:fi_paramget():372 variable hmem_disable_p2p= libfabric:2795:1730110628::core:mr:ofi_monitors_init():222 Initializing memory monitor uffd libfabric:2795:1730110628::core:mr:ofi_monitors_init():222 Initializing memory monitor memhooks libfabric:2795:1730110628::core:mr:ofi_monitors_init():222 Initializing memory monitor cuda libfabric:2795:1730110628::core:mr:ofi_monitors_init():222 Initializing memory monitor cuda_ipc libfabric:2795:1730110628::core:mr:ofi_monitors_init():222 Initializing memory monitor rocr libfabric:2795:1730110628::core:mr:ofi_monitors_init():222 Initializing memory monitor rocr_ipc libfabric:2795:1730110628::core:mr:ofi_monitors_init():222 Initializing memory monitor xpmem libfabric:2795:1730110628::core:mr:ofi_monitors_init():222 Initializing memory monitor ze libfabric:2795:1730110628::core:mr:ofi_monitors_init():222 Initializing memory monitor import libfabric:2795:1730110628::core:core:fi_paramget():372 variable mr_cache_max_size= libfabric:2795:1730110628::core:core:fi_paramget():372 variable mr_cache_max_count= libfabric:2795:1730110628::core:core:fi_paramget():372 variable mr_cache_monitor= libfabric:2795:1730110628::core:core:fi_paramget():372 variable mr_cuda_cache_monitor_enabled= libfabric:2795:1730110628::core:core:fi_paramget():372 variable mr_rocr_cache_monitor_enabled= libfabric:2795:1730110628::core:core:fi_paramget():372 variable mr_ze_cache_monitor_enabled= libfabric:2795:1730110628::core:mr:ofi_default_cache_size():83 default cache size=5633248768 libfabric:2795:1730110628::core:mr:ofi_monitors_init():306 Default memory monitor is: memhooks libfabric:2795:1730110628::core:core:fi_paramget():372 variable provider= libfabric:2795:1730110628::core:core:fi_paramget():372 variable universe_size= libfabric:2795:1730110628::core:core:fi_paramget():372 variable av_remove_cleanup= libfabric:2795:1730110628::core:core:fi_paramget():372 variable offload_coll_provider= libfabric:2795:1730110628::core:core:fi_paramget():372 variable provider_path= libfabric:2795:1730110628::core:core:ofi_register_provider():518 registering provider: udp (121.0) libfabric:2795:1730110628::core:core:ofi_register_provider():518 registering provider: sockets (121.0) libfabric:2795:1730110628::tcp:core:fi_paramget():372 variable prov_name= libfabric:2795:1730110628::tcp:core:fi_paramget():372 variable port_high_range= libfabric:2795:1730110628::tcp:core:fi_paramget():372 variable port_low_range= libfabric:2795:1730110628::tcp:core:fi_paramget():372 variable tx_size= libfabric:2795:1730110628::tcp:core:fi_paramget():372 variable rx_size= libfabric:2795:1730110628::tcp:core:fi_paramget():372 variable max_inject= libfabric:2795:1730110628::tcp:core:fi_paramget():372 variable max_saved= libfabric:2795:1730110628::tcp:core:fi_paramget():372 variable max_saved_size= libfabric:2795:1730110628::tcp:core:fi_paramget():372 variable max_rx_size= libfabric:2795:1730110628::tcp:core:fi_paramget():372 variable nodelay= libfabric:2795:1730110628::tcp:core:fi_paramget():372 variable staging_sbuf_size= libfabric:2795:1730110628::tcp:core:fi_paramget():372 variable prefetch_rbuf_size= libfabric:2795:1730110628::tcp:core:fi_paramget():372 variable zerocopy_size= libfabric:2795:1730110628::tcp:core:fi_paramget():372 variable trace_msg= libfabric:2795:1730110628::tcp:core:fi_paramget():372 variable disable_auto_progress= libfabric:2795:1730110628::tcp:core:fi_paramget():372 variable io_uring= libfabric:2795:1730110628::core:core:ofi_register_provider():518 registering provider: tcp (121.0) libfabric:2795:1730110628::core:core:ofi_register_provider():518 registering provider: ofi_hook_noop (121.0) libfabric:2795:1730110628::core:core:ofi_register_provider():518 registering provider: off_coll (121.0)

By the way,what time to support ip6?

vegetableysm commented 1 month ago

I think one of the reasons vineyard was unable to create an RDMA server was because of the ipv6 address. But I don't know why fi_info can't get device information. If fi_info does not get device information, vineyard theoretically cannot get device information even if it is using ipv4.

And ipv6 support is not in our short-term plans at the moment. You can open a new issue about the ipv6 support and we may support ipv6 in the future. Thanks.

hsh258 commented 1 month ago

I think one of the reasons vineyard was unable to create an RDMA server was because of the ipv6 address. But I don't know why fi_info can't get device information. If fi_info does not get device information, vineyard theoretically cannot get device information even if it is using ipv4.

And ipv6 support is not in our short-term plans at the moment. You can open a new issue about the ipv6 support and we may support ipv6 in the future. Thanks.

Hi, Whether or not to install other package besides librdmacm.so and libibverbs.so in container scene ?For example ofed,and so on。 About fi_info,I use it by copy libfabric/util/fi_info and libfabric/util/.libs/ to container. Is this method okay? Use ip4,the appearance is same to ip6 .fi_getinfo return -FI_ENODATA too. In container(ip4 or ip6),use rdma link command,can look up rdma dev,but fi_info -p verebs ,has nothing. rdma link link mlx5_2/1 state DOWN physical_state DISABLED link mlx5_3/1 state DOWN physical_state DISABLED link mlx5_4/1 state ACTIVE physical_state LINK_UP link mlx5_5/1 state ACTIVE physical_state LINK_UP link mlx5_6/1 state DOWN physical_state DISABLED link mlx5_7/1 state DOWN physical_state DISABLED link mlx5_8/1 state ACTIVE physical_state LINK_UP link mlx5_9/1 state ACTIVE physical_state LINK_UP link mlx5_10/1 state DOWN physical_state DISABLED link mlx5_11/1 state DOWN physical_state DISABLED link mlx5_12/1 state ACTIVE physical_state LINK_UP link mlx5_13/1 state ACTIVE physical_state LINK_UP link mlx5_14/1 state DOWN physical_state DISABLED link mlx5_15/1 state DOWN physical_state DISABLED link mlx5_16/1 state ACTIVE physical_state LINK_UP link mlx5_17/1 state ACTIVE physical_state LINK_UP link mlx5_bond_1/1 state ACTIVE physical_state LINK_UP link mlx5_0/1 state ACTIVE physical_state LINK_UP link mlx5_1/1 state ACTIVE physical_state LINK_UP link mlx5_18/1 state DOWN physical_state DISABLED link mlx5_19/1 state DOWN physical_state DISABLED link mlx5_20/1 state DOWN physical_state DISABLED link mlx5_21/1 state DOWN physical_state DISABLED link mlx5_22/1 state DOWN physical_state DISABLED link mlx5_23/1 state DOWN physical_state DISABLED link mlx5_24/1 state DOWN physical_state DISABLED link mlx5_25/1 state DOWN physical_state DISABLED link mlx5_26/1 state DOWN physical_state DISABLED link mlx5_27/1 state DOWN physical_state DISABLED link mlx5_28/1 state DOWN physical_state DISABLED link mlx5_29/1 state DOWN physical_state DISABLED link mlx5_30/1 state DOWN physical_state DISABLED link mlx5_31/1 state DOWN physical_state DISABLED link mlx5_32/1 state DOWN physical_state DISABLED link mlx5_33/1 state DOWN physical_state DISABLED link mlx5_34/1 state DOWN physical_state DISABLED link mlx5_35/1 state DOWN physical_state DISABLED link mlx5_36/1 state DOWN physical_state DISABLED link mlx5_37/1 state DOWN physical_state DISABLED link mlx5_38/1 state DOWN physical_state DISABLED link mlx5_39/1 state DOWN physical_state DISABLED link mlx5_40/1 state DOWN physical_state DISABLED link mlx5_41/1 state DOWN physical_state DISABLED link mlx5_42/1 state DOWN physical_state DISABLED link mlx5_43/1 state DOWN physical_state DISABLED link mlx5_44/1 state DOWN physical_state DISABLED link mlx5_45/1 state DOWN physical_state DISABLED link mlx5_46/1 state DOWN physical_state DISABLED link mlx5_47/1 state DOWN physical_state DISABLED link mlx5_48/1 state ACTIVE physical_state LINK_UP netdev net_101 link mlx5_49/1 state DOWN physical_state DISABLED

vegetableysm commented 1 month ago

Fabric depends on libibverbs. So libibverbs is necessery.

I suggests that you should install the libfabric and fabtest to use fi_info. Refer to the script below:

For fabric dependencies(CentOS):

yum -y install rdma-core libibverbs libibverbs-devel

Install fabric and fabtests

cd /tmp
wget https://github.com/ofiwg/libfabric/releases/download/v1.22.0/libfabric-1.22.0.tar.bz2
tar xf ./libfabric-1.22.0.tar.bz2
cd libfabric-1.22.0/
./configure --disable-usnic \
            --disable-psm3 \
            --disable-opx \
            --disable-dmabuf_peer_mem \
            --disable-hook_hmem \
            --disable-hook_debug \
            --disable-trace \
            --disable-rxm \
            --disable-psm2 \
            --disable-xpmem \
            --disable-shm \
            --disable-rxd \
            --disable-perf \
            --disable-efa \
            --disable-mrail \
            --enable-verbs \
            --with-cuda=no
make -j
make install

cd /tmp
wget https://github.com/ofiwg/libfabric/releases/download/v1.22.0/fabtests-1.22.0.tar.bz2
tar xf ./fabtests-1.22.0.tar.bz2
cd fabtests-1.22.0
./configure
make -j
make install

Again, vineyard compiles the fabric in the submodule itself, so in theory you only need the ibverbs library to use vineyard RDMA support (The premise is that fabric can also work alone). You can install the fabtests according to the script above and see if the fabtests works.(Such as fi_rma_bw / fi_info)

hsh258 commented 1 month ago

./vineyardd --rdma_endpoint=ipv4_addr:port

Hi ./vineyardd --rdma_endpoint=ipv4_addr:port The command ipv4_addr has or not mask? format is 1.2.3.4/24:1234 or 1.2.3.4:1234 ?

vegetableysm commented 1 month ago

./vineyardd --rdma_endpoint=ipv4_addr:port

Hi ./vineyardd --rdma_endpoint=ipv4_addr:port The command ipv4_addr has or not mask? format is 1.2.3.4/24:1234 or 1.2.3.4:1234 ?

Without mask. It should be the ipv4 address of RDMA device.

vegetableysm commented 1 month ago

./vineyardd --rdma_endpoint=ipv4_addr:port

Hi ./vineyardd --rdma_endpoint=ipv4_addr:port The command ipv4_addr has or not mask? format is 1.2.3.4/24:1234 or 1.2.3.4:1234 ?

The format of "ipv4:port" only affects the parsing of the port. The reason that it cannot use IPv6 is the same, as it parses the content after the first ":" as the port. The currently specified RDMA IPv4 address will not take effect; instead, it will automatically look for the first suitable RDMA device.

Refer to https://github.com/v6d-io/v6d/issues/2005 https://github.com/v6d-io/v6d/pull/2006

This WIP PR supports specifying a particular RDMA device by indicating its IPv4 address. However, it cannot be merged into the main branch for now because the CI failed. Refer to: https://github.com/v6d-io/v6d/issues/2008

vegetableysm commented 1 month ago

Therefore, I suggest that it is a priority to ensure that fi_info can get the RDMA device information. If fi_info can retrieve the RDMA device information, then vineyard should initialize successfully. If fi_info cannot get the RDMA device information, then vineyard will not be able to initialize either.

hsh258 commented 1 month ago

./vineyardd --rdma_endpoint=ipv4_addr:port

Hi ./vineyardd --rdma_endpoint=ipv4_addr:port The command ipv4_addr has or not mask? format is 1.2.3.4/24:1234 or 1.2.3.4:1234 ?

The format of "ipv4:port" only affects the parsing of the port. The reason that it cannot use IPv6 is the same, as it parses the content after the first ":" as the port. The currently specified RDMA IPv4 address will not take effect; instead, it will automatically look for the first suitable RDMA device.

Refer to #2005 #2006

This WIP PR supports specifying a particular RDMA device by indicating its IPv4 address. However, it cannot be merged into the main branch for now because the CI failed. Refer to: #2008

Hi As I has Ip6 environment only,I will try Ip6.Whether or not is ok when i parses ipv6:port to get port? tks. By the way,as rdma client,read_env("VINEYARD_RDMA_ENDPOINT"), Who write VINEYARD_RDMA_ENDPOINT? If RPCServer has tcp server and RDMAServer with different port at the same time, Whether or not to conflict?tks

vegetableysm commented 1 month ago

Whether or not is ok when i parses ipv6:port to get port?

No. But you can give a fake ipv4 address. Because vineyard will automatically look for the first suitable RDMA device. As I said above, specifying NIC by address will be supported in the next pr.

By the way,as rdma client,read_env("VINEYARD_RDMA_ENDPOINT"), Who write VINEYARD_RDMA_ENDPOINT?

If you provide rdma_endpoint when trying to connect to vineyardd, this environment variable will not be read. If you don't give it, it will try to read it. Environment variables are also set by the user.

If RPCServer has tcp server and RDMAServer with different port at the same time, Whether or not to conflict?

They won't conflict.

hsh258 commented 1 month ago

Whether or not is ok when i parses ipv6:port to get port?

No. But you can give a fake ipv4 address. Because vineyard will automatically look for the first suitable RDMA device. As I said above, specifying NIC by address will be supported in the next pr.

By the way,as rdma client,read_env("VINEYARD_RDMA_ENDPOINT"), Who write VINEYARD_RDMA_ENDPOINT?

If you provide rdma_endpoint when trying to connect to vineyardd, this environment variable will not be read. If you don't give it, it will try to read it. Environment variables are also set by the user.

If RPCServer has tcp server and RDMAServer with different port at the same time, Whether or not to conflict?

They won't conflict.

Hi as client,usr set VINEYARD_RDMA_ENDPOINT,for example 1.2.3.4:1234,is it client self address or server address ? tks

vegetableysm commented 1 month ago

as client,usr set VINEYARD_RDMA_ENDPOINT,for example 1.2.3.4:1234,is it client self address or server address ? tks

The client should use the exact ipv4 address of vineyard server. Suppose the NIC of the server is at the address 1.2.3.4, and vineyard RDMA server use port of 1234. You should use 1.2.3.4:1234 as the VINEYARD_RDMA_ENDPOINT of client. It is also currently not possible for client to specify the NIC used to send the data, so this field means the address of the server.

As I said above, specifying NIC by address will be supported in the next pr. These include the NIC used by the server to receive data and the NIC used by the client to send data.

To summarize, there is currently no way for the server to specify the NIC, and the server will automatically select the appropriate NIC to listen to RDMA messages. The rdma endpoint on the client side is the address of the server. It is also currently not possible for client to specify the NIC used for sending data. The feature to specify NIC will be supported in the pr mentioned above, but cannot currently be merged into the main branch.

hsh258 commented 1 month ago

as client,usr set VINEYARD_RDMA_ENDPOINT,for example 1.2.3.4:1234,is it client self address or server address ? tks

The client should use the exact ipv4 address of vineyard server. Suppose the NIC of the server is at the address 1.2.3.4, and vineyard RDMA server use port of 1234. You should use 1.2.3.4:1234 as the VINEYARD_RDMA_ENDPOINT of client. It is also currently not possible for client to specify the NIC used to send the data, so this field means the address of the server.

As I said above, specifying NIC by address will be supported in the next pr. These include the NIC used by the server to receive data and the NIC used by the client to send data.

To summarize, there is currently no way for the server to specify the NIC, and the server will automatically select the appropriate NIC to listen to RDMA messages. The rdma endpoint on the client side is the address of the server. It is also currently not possible for client to specify the NIC used for sending data. The feature to specify NIC will be supported in the pr mentioned above, but cannot currently be merged into the main branch.

Hi Server has start,as server config itself container address,--rdma_endpoint 60.30.10.50:9600 I20241031 15:04:05.911379 7 rpc_server.cc:105] Vineyard will listen on 0.0.0.0:9600 for RPC I20241031 15:04:08.535791 7 rpc_server.cc:109] Vineyard will listen on 60.30.10.2:9600 for RDMA I20241031 15:04:08.537207 7 meta_service.cc:1195] Instance join: 0 Above,60.30.10.50 is rdma net dev address.

However,detail show rpc service is TCP, no RDMA. kubectl get services -n admin |grep vineyard vineyard-controller-manager-metrics-service ClusterIP 8443/TCP 21m vineyard-webhook-service ClusterIP 443/TCP 21m vineyardd-sample-etcd-service ClusterIP 2379/TCP 8m18s vineyardd-sample-redis-0 ClusterIP 6379/TCP 8m18s vineyardd-sample-redis-service ClusterIP 6379/TCP 8m18s vineyardd-sample-rpc ClusterIP 102.11.82.204 9600/TCP 8m17s

At the same time , As client, if VINEYARD_RDMA_ENDPOINT set to server address 60.30.10.50:9600,show Connect RDMA server failed! Fall back to RPC mode. Error:fi_getinfo failed. Why?client can't find server? As client, if VINEYARD_RDMA_ENDPOINT set to itself address 60.30.10.51:9600,show rpc_client = vineyard.connect('vineyardd-sample-rpc.admin',9600) Connect rdma server failed! retry: 1 times. it show cilent can find driver. Above server and client are in the same node.

vegetableysm commented 4 weeks ago

I20241031 15:04:05.911379 7 rpc_server.cc:105] Vineyard will listen on 0.0.0.0:9600 for RPC I20241031 15:04:08.535791 7 rpc_server.cc:109] Vineyard will listen on 60.30.10.2:9600 for RDMA

Do not make RDMA and RPC work on the same port if they use the same NIC.

vegetableysm commented 4 weeks ago

As client, if VINEYARD_RDMA_ENDPOINT set to server address 60.30.10.50:9600,show Connect RDMA server failed! Fall back to RPC mode. Error:fi_getinfo failed. Why?client can't find server? As client, if VINEYARD_RDMA_ENDPOINT set to itself address 60.30.10.51:9600,show rpc_client = vineyard.connect('vineyardd-sample-rpc.admin',9600)

Are the client and server in the same container? If not, can the client's container connect to the server?

You can start the server and client in the same container to test if the vineyard RDMA works.

hsh258 commented 4 weeks ago

As client, if VINEYARD_RDMA_ENDPOINT set to server address 60.30.10.50:9600,show Connect RDMA server failed! Fall back to RPC mode. Error:fi_getinfo failed. Why?client can't find server? As client, if VINEYARD_RDMA_ENDPOINT set to itself address 60.30.10.51:9600,show rpc_client = vineyard.connect('vineyardd-sample-rpc.admin',9600)

Are the client and server in the same container? If not, can the client's container connect to the server?

You can start the server and client in the same container to test if the vineyard RDMA works.

Hi Now ,when client put data,then it has issue,Why? As client:

import numpy as np import vineyard rpc_client = vineyard.connect('vineyardd-sample-rpc.admin',9600) Connected to RPC server: vineyardd-sample-rpc.admin:9600, RDMA server: 10.11.228.2:9600 objid = rpc_client.put(np.zeros(8)) mlx5: vineyard-python-client-847f8b8b86-s9t55: got completion with error: 00000000 00000000 00000000 00000000 00000000 00000000 00000000 00000000 00000000 00000000 00000000 00000000 00000000 00008813 08000241 0001fdd2 Traceback (most recent call last): File "", line 1, in File "/usr/local/lib/python3.8/dist-packages/vineyard/core/client.py", line 823, in put return put(self, value, builder, persist, name, kwargs) File "/usr/local/lib/python3.8/dist-packages/vineyard/core/builder.py", line 197, in put meta = get_current_builders().run(client, value, kwargs) File "/usr/local/lib/python3.8/dist-packages/vineyard/core/builder.py", line 100, in run return self._factory[ty](client, value, **kw) File "/usr/local/lib/python3.8/dist-packages/vineyard/data/tensor.py", line 90, in numpy_ndarray_builder meta.addmember('buffer', build_numpy_buffer(client, value)) File "/usr/local/lib/python3.8/dist-packages/vineyard/data/utils.py", line 178, in build_numpy_buffer return build_buffer(client, address, array.nbytes) File "/usr/local/lib/python3.8/dist-packages/vineyard/data/utils.py", line 157, in build_buffer return client.create_remote_blob(buffer) File "/usr/local/lib/python3.8/dist-packages/vineyard/core/client.py", line 575, in create_remote_blob return self.rpc_client.create_remote_blob(blob_builder) vineyard._C.InvalidException: Invalid: GetTXCompletion failed:-5

As server: E20241103 02:35:23.640290 175 rpc_server.cc:203] Receive vineyard request mem! E20241103 02:35:23.640353 175 rpc_server.cc:208] Receive remote request address: 0x7f04ebbfe040 size: 64 E20241103 02:35:23.640825 175 rpc_server.cc:238] Failed to register mem. E20241103 02:35:23.645750 273 rpc_server.cc:389] Connection error!Client crashed.

E20241103 02:53:37.765411 178 rpc_server.cc:203] Receive vineyard request mem! E20241103 02:53:37.765699 178 rpc_server.cc:208] Receive remote request address: 0x7f04ebffe100 size: 4194304 E20241103 02:53:37.765759 178 rpc_server.cc:238] Failed to register mem. E20241103 02:53:37.771457 273 rpc_server.cc:389] Connection error!Client crashed.

E20241103 03:03:53.214725 181 rpc_server.cc:203] Receive vineyard request mem! E20241103 03:03:53.214792 181 rpc_server.cc:208] Receive remote request address: 0x7f04ec3fe140 size: 8192 E20241103 03:03:53.214845 181 rpc_server.cc:238] Failed to register mem. E20241103 02:53:37.771457 273 rpc_server.cc:389] Connection error!Client crashed.

As the same client,don't change anything, it is sometimes possible to put success, but sometimes fail. I currently can only find the focus is func fi_mr_regattr.But what time success,what time fail as the same envirment?

vegetableysm commented 3 weeks ago

Hi. Could you please show me the complete instructions to start vineyardd and the code of putting object on the client side? Let me test it locally.

hsh258 commented 3 weeks ago

Hi. Could you please show me the complete instructions to start vineyardd and the code of putting object on the client side? Let me test it locally.

Hi, instructions are there: use python client, login export VINEYARD_RDMA_ENDPOINT=10.13.228.2:9600 // 10.13.228.2 is the rdma server address python3

import numpy as np import vineyard rpc_client = vineyard.connect('vineyardd-sample-rpc.admin',9600) //it needs adout 20s success objid = rpc_client.put(np.zeros(8)) // it is sometimes possible to put success, but sometimes fail.

vegetableysm commented 3 weeks ago

Hi. Could you please show me the complete instructions to start vineyardd and the code of putting object on the client side? Let me test it locally.

Hi, instructions are there: use python client, login export VINEYARD_RDMA_ENDPOINT=10.13.228.2:9600 // 10.13.228.2 is the rdma server address python3

import numpy as np import vineyard rpc_client = vineyard.connect('vineyardd-sample-rpc.admin',9600) //it needs adout 20s success objid = rpc_client.put(np.zeros(8)) // it is sometimes possible to put success, but sometimes fail.

And the command of starting a vineyardd?

hsh258 commented 3 weeks ago

Hi. Could you please show me the complete instructions to start vineyardd and the code of putting object on the client side? Let me test it locally.

Hi, instructions are there: use python client, login export VINEYARD_RDMA_ENDPOINT=10.13.228.2:9600 // 10.13.228.2 is the rdma server address python3

import numpy as np import vineyard rpc_client = vineyard.connect('vineyardd-sample-rpc.admin',9600) //it needs adout 20s success objid = rpc_client.put(np.zeros(8)) // it is sometimes possible to put success, but sometimes fail.

And the command of starting a vineyardd? as server, set: json RpcSpecResolver::resolve() const { json spec; spec["rpc"] = FLAGS_rpc; spec["port"] = FLAGS_rpc_socket_port; spec["rdma_endpoint"] = "10.13.228.2:9600"; return spec; } then deployment vineyard use helm chart

vegetableysm commented 3 weeks ago

And If registering memory fails, try increasing the vineyard's available memory.

hsh258 commented 3 weeks ago

And If registering memory fails, try increasing the vineyard's available memory. Hi as 2Gi memory,it is same, sometimes possible to put success, but sometimes fail. cat <<EOF | kubectl apply -f - apiVersion: k8s.v6d.io/v1alpha1 kind: Vineyardd metadata: name: vineyardd-sample namespace: admin spec: replicas: 1 metaServiceReplicas: 1 service: type: ClusterIP port: 9600 vineyard: image: test:test/admin/vineyardd:v1 imagePullPolicy: IfNotPresent cpu: "2" memory: "2Gi" securityContext: privileged: true EOF

dashanji commented 3 weeks ago

Hi @hsh258. The 9600 is the default port for RPC, you should define another unique port for RDMA endpoint such as "10.13.228.2:9601"

hsh258 commented 3 weeks ago

Hi @hsh258. The 9600 is the default port for RPC, you should define another unique port for RDMA endpoint such as "10.13.228.2:9601"

Hi , If set other port, connect fail,show:

export VINEYARD_RDMA_ENDPOINT=10.13.228.2:19601

python3

Python 3.8.10 (default, Sep 11 2024, 16:02:53) [GCC 9.4.0] on linux Type "help", "copyright", "credits" or "license" for more information.

import numpy as np import vineyard rpc_client = vineyard.connect('vineyardd-sample-rpc.admin',19601) Traceback (most recent call last): File "", line 1, in File "/usr/local/lib/python3.8/dist-packages/vineyard/init.py", line 418, in connect return Client(*args, *kwargs) File "/usr/local/lib/python3.8/dist-packages/vineyard/core/client.py", line 296, in init raise ConnectionError( ConnectionError: Failed to connect to vineyard via both IPC and RPC connection. Arguments, environment variables VINEYARD_IPC_SOCKET and VINEYARD_RPC_ENDPOINT, as well as the configuration file, are all unavailable. rpc_client = vineyard.connect('10.13.228.2',19601) Traceback (most recent call last): File "", line 1, in File "/usr/local/lib/python3.8/dist-packages/vineyard/init.py", line 418, in connect return Client(args, **kwargs) File "/usr/local/lib/python3.8/dist-packages/vineyard/core/client.py", line 296, in init raise ConnectionError( ConnectionError: Failed to connect to vineyard via both IPC and RPC connection. Arguments, environment variables VINEYARD_IPC_SOCKET and VINEYARD_RPC_ENDPOINT, as well as the configuration file, are all unavailable.

dashanji commented 3 weeks ago

The rpc must be connected at first while using the rdma, you can try the following code.

export VINEYARD_RDMA_ENDPOINT=10.13.228.2:19601
import numpy as np
import vineyard
rpc_client = vineyard.connect('vineyardd-sample-rpc.admin',9600)
hsh258 commented 3 weeks ago

The rpc must be connected at first while using the rdma, you can try the following code.

export VINEYARD_RDMA_ENDPOINT=10.13.228.2:19601
import numpy as np
import vineyard
rpc_client = vineyard.connect('vineyardd-sample-rpc.admin',9600)

Hi, accord to above order, The issue is same,especially when server and client are in the different node. In server: rpc_server.cc:105] Vineyard will listen on 0.0.0.0:9600 for RPC rpc_server.cc:109] Vineyard will listen on 13.13.229.3:19800 for RDMA rpc_server.cc:203] Receive vineyard request mem! rpc_server.cc:208] Receive remote request address: 0x7fe39bfff040 size: 31457280 rpc_server.cc:241] Failed to register mem. size 31457280 rpc_server.cc:392] Connection error!Client crashed. rpc_server.cc:334] Receive close msg! rpc_server.cc:400] Get RX completion failed! Error:Client crashed.

In client: mlx5: vineyard-python-client-65dfffb656-ghc72: got completion with error: 00000000 00000000 00000000 00000000 00000000 00000000 00000000 00000000 00000000 00000000 00000000 00000000 00000000 00008813 080004bc 000106d2 Traceback (most recent call last): File "", line 1, in File "", line 8, in process_blocks File "/usr/local/lib/python3.8/dist-packages/vineyard/core/client.py", line 823, in put return put(self, value, builder, persist, name, kwargs) File "/usr/local/lib/python3.8/dist-packages/vineyard/core/builder.py", line 197, in put meta = get_current_builders().run(client, value, kwargs) File "/usr/local/lib/python3.8/dist-packages/vineyard/core/builder.py", line 100, in run return self._factory[ty](client, value, **kw) File "/usr/local/lib/python3.8/dist-packages/vineyard/data/tensor.py", line 90, in numpy_ndarray_builder meta.addmember('buffer', build_numpy_buffer(client, value)) File "/usr/local/lib/python3.8/dist-packages/vineyard/data/utils.py", line 178, in build_numpy_buffer return build_buffer(client, address, array.nbytes) File "/usr/local/lib/python3.8/dist-packages/vineyard/data/utils.py", line 157, in build_buffer return client.create_remote_blob(buffer) File "/usr/local/lib/python3.8/dist-packages/vineyard/core/client.py", line 575, in create_remote_blob return self.rpc_client.create_remote_blob(blob_builder) vineyard._C.InvalidException: Invalid: GetTXCompletion failed:-5

dashanji commented 3 weeks ago

Could you please add an option (size:1024Mi) to the vineyard yaml as follows and try again?

Hi

as 2Gi memory,it is same, sometimes possible to put success, but sometimes fail.

cat <<EOF | kubectl apply -f -

apiVersion: k8s.v6d.io/v1alpha1

kind: Vineyardd

metadata:

name: vineyardd-sample

namespace: admin

spec:

replicas: 1

metaServiceReplicas: 1

service:

type: ClusterIP

port: 9600

vineyard: size:1024Mi image: test:test/admin/vineyardd:v1

imagePullPolicy: IfNotPresent

cpu: "2"

memory: "2Gi"

securityContext:

privileged: true

EOF

hsh258 commented 3 weeks ago

size:1024Mi

Hi deploy has issue in "size" , size:1024Mi error: error parsing STDIN: error converting YAML to JSON: yaml: line 16: mapping values are not allowed in this context size:"1024Mi" error: error parsing STDIN: error converting YAML to JSON: yaml: line 16: mapping values are not allowed in this context

dashanji commented 3 weeks ago

Sorry for the misleading indentation. You could try the following command.

cat <<EOF | kubectl apply -f -
apiVersion: k8s.v6d.io/v1alpha1
kind: Vineyardd
metadata:
  name: vineyardd-sample
  namespace: admin
spec:
  replicas: 1
  service:
    type: ClusterIP
    port: 9600
  vineyard:
    size: 1024Mi
    image: test:test/admin/vineyardd:v1
    imagePullPolicy: IfNotPresent
    cpu: "2"
    memory: "2Gi"
  securityContext:
    privileged: true
EOF
hsh258 commented 3 weeks ago

Sorry for the misleading indentation. You could try the following command.

cat <<EOF | kubectl apply -f -
apiVersion: k8s.v6d.io/v1alpha1
kind: Vineyardd
metadata:
  name: vineyardd-sample
  namespace: admin
spec:
  replicas: 1
  service:
    type: ClusterIP
    port: 9600
  vineyard:
    size: 1024Mi
    image: test:test/admin/vineyardd:v1
    imagePullPolicy: IfNotPresent
    cpu: "2"
    memory: "2Gi"
  securityContext:
    privileged: true
EOF

Hi, above issue, error: error parsing STDIN: error converting YAML to JSON: yaml: line 16: mapping values are not allowed in this context As deply: cat <<EOF | kubectl apply -f - apiVersion: k8s.v6d.io/v1alpha1 kind: Vineyardd metadata: name: vineyardd-sample namespace: admin spec: replicas: 1 service: type: ClusterIP port: 9600 vineyard: size: 1024Mi image: test:test/admin/vineyardd:v1 imagePullPolicy: IfNotPresent cpu: "2" memory: "2Gi" securityContext: privileged: true EOF

hsh258 commented 3 weeks ago

The rpc must be connected at first while using the rdma, you can try the following code.

export VINEYARD_RDMA_ENDPOINT=10.13.228.2:19601
import numpy as np
import vineyard
rpc_client = vineyard.connect('vineyardd-sample-rpc.admin',9600)

Hi, As exec "rpc_client = vineyard.connect('vineyardd-sample-rpc.admin',9600)",it cost long time,from a few seconds to over 20 seconds. add debug info,it focus the func CHECK_ERROR(!fi_getinfo(VINEYARD_FIVERSION, server_address.c_str(), std::to_string(port).c_str(), 0, hints, reinterpret_cast<fi_info**>(&(info.fi))) Could you tell me how to shorten the time? tks

dashanji commented 3 weeks ago

Sorry for the misleading indentation. You could try the following command.

cat <<EOF | kubectl apply -f -
apiVersion: k8s.v6d.io/v1alpha1
kind: Vineyardd
metadata:
  name: vineyardd-sample
  namespace: admin
spec:
  replicas: 1
  service:
    type: ClusterIP
    port: 9600
  vineyard:
    size: 1024Mi
    image: test:test/admin/vineyardd:v1
    imagePullPolicy: IfNotPresent
    cpu: "2"
    memory: "2Gi"
  securityContext:
    privileged: true
EOF

Hi, above issue, error: error parsing STDIN: error converting YAML to JSON: yaml: line 16: mapping values are not allowed in this context As deply: cat <<EOF | kubectl apply -f - apiVersion: k8s.v6d.io/v1alpha1 kind: Vineyardd metadata: name: vineyardd-sample namespace: admin spec: replicas: 1 service: type: ClusterIP port: 9600 vineyard: size: 1024Mi image: test:test/admin/vineyardd:v1 imagePullPolicy: IfNotPresent cpu: "2" memory: "2Gi" securityContext: privileged: true EOF

Does it work now? I think It should be caused by the indentation problem.

dashanji commented 3 weeks ago

The rpc must be connected at first while using the rdma, you can try the following code.

export VINEYARD_RDMA_ENDPOINT=10.13.228.2:19601
import numpy as np
import vineyard
rpc_client = vineyard.connect('vineyardd-sample-rpc.admin',9600)

Hi, As exec "rpc_client = vineyard.connect('vineyardd-sample-rpc.admin',9600)",it cost long time,from a few seconds to over 20 seconds. add debug info,it focus the func CHECK_ERROR(!fi_getinfo(VINEYARD_FIVERSION, server_address.c_str(), std::to_string(port).c_str(), 0, hints, reinterpret_cast<fi_info**>(&(info.fi))) Could you tell me how to shorten the time? tks

It shouldn't be very slow, what's your k8s environment (ack/aws/...) and machine environment ?

hsh258 commented 3 weeks ago

Sorry for the misleading indentation. You could try the following command.

cat <<EOF | kubectl apply -f -
apiVersion: k8s.v6d.io/v1alpha1
kind: Vineyardd
metadata:
  name: vineyardd-sample
  namespace: admin
spec:
  replicas: 1
  service:
    type: ClusterIP
    port: 9600
  vineyard:
    size: 1024Mi
    image: test:test/admin/vineyardd:v1
    imagePullPolicy: IfNotPresent
    cpu: "2"
    memory: "2Gi"
  securityContext:
    privileged: true
EOF

Hi, above issue, error: error parsing STDIN: error converting YAML to JSON: yaml: line 16: mapping values are not allowed in this context As deply: cat <<EOF | kubectl apply -f - apiVersion: k8s.v6d.io/v1alpha1 kind: Vineyardd metadata: name: vineyardd-sample namespace: admin spec: replicas: 1 service: type: ClusterIP port: 9600 vineyard: size: 1024Mi image: test:test/admin/vineyardd:v1 imagePullPolicy: IfNotPresent cpu: "2" memory: "2Gi" securityContext: privileged: true EOF

Does it work now? I think It should be caused by the indentation problem.

Hi, It always has the issue,so current delete "size: 1024Mi" when deploy it.

hsh258 commented 3 weeks ago

The rpc must be connected at first while using the rdma, you can try the following code.

export VINEYARD_RDMA_ENDPOINT=10.13.228.2:19601
import numpy as np
import vineyard
rpc_client = vineyard.connect('vineyardd-sample-rpc.admin',9600)

Hi, As exec "rpc_client = vineyard.connect('vineyardd-sample-rpc.admin',9600)",it cost long time,from a few seconds to over 20 seconds. add debug info,it focus the func CHECK_ERROR(!fi_getinfo(VINEYARD_FIVERSION, server_address.c_str(), std::to_string(port).c_str(), 0, hints, reinterpret_cast<fi_info**>(&(info.fi))) Could you tell me how to shorten the time? tks

It shouldn't be very slow, what's your k8s environment (ack/aws/...) and machine environment ?

Hi, kubectl Version: v1.28.3

dashanji commented 3 weeks ago

cat <<EOF | kubectl apply -f - apiVersion: k8s.v6d.io/v1alpha1 kind: Vineyardd metadata: name: vineyardd-sample namespace: admin spec: replicas: 1 service: type: ClusterIP port: 9600 vineyard: size: 1024Mi image: test:test/admin/vineyardd:v1 imagePullPolicy: IfNotPresent cpu: "2" memory: "2Gi" securityContext: privileged: true EOF

How do you install the vineyard operator? Besides, can you copy the code to the shell and try again, it's better to show the failed screenshot so that we can check where is wrong.

hsh258 commented 3 weeks ago

cat <<EOF | kubectl apply -f - apiVersion: k8s.v6d.io/v1alpha1 kind: Vineyardd metadata: name: vineyardd-sample namespace: admin spec: replicas: 1 service: type: ClusterIP port: 9600 vineyard: size: 1024Mi image: test:test/admin/vineyardd:v1 imagePullPolicy: IfNotPresent cpu: "2" memory: "2Gi" securityContext: privileged: true EOF

How do you install the vineyard operator? Besides, can you copy the code to the shell and try again, it's better to show the failed screenshot so that we can check where is wrong.

Hi, Use helm chart to install the vineyard operator

dashanji commented 3 weeks ago

error: error parsing STDIN: error converting YAML to JSON: yaml: line 16: mapping values are not allowed in this context This error shouldn't happen. In my test environment, it can work fine as follows.

$ helm repo update
$ kubectl create namespace vineyard-system
$ helm install vineyard-operator vineyard/vineyard-operator -n vineyard-system
$ cat <<EOF | kubectl apply -f -
apiVersion: k8s.v6d.io/v1alpha1
kind: Vineyardd
metadata:
name: vineyardd-sample
namespace: admin
spec:
replicas: 1
service:
type: ClusterIP
port: 9600
vineyard:
size: 1024Mi
image: test:test/admin/vineyardd:v1
imagePullPolicy: IfNotPresent
cpu: "2"
memory: "2Gi"
securityContext:
privileged: true
EOF

image

hsh258 commented 3 weeks ago

error: error parsing STDIN: error converting YAML to JSON: yaml: line 16: mapping values are not allowed in this context This error shouldn't happen. In my test environment, it can work fine as follows.

$ helm repo update
$ kubectl create namespace vineyard-system
$ helm install vineyard-operator vineyard/vineyard-operator -n vineyard-system
$ cat <<EOF | kubectl apply -f -
apiVersion: k8s.v6d.io/v1alpha1
kind: Vineyardd
metadata:
  name: vineyardd-sample
  namespace: admin
spec:
  replicas: 1
  service:
    type: ClusterIP
    port: 9600
  vineyard:
    size: 1024Mi
    image: test:test/admin/vineyardd:v1
    imagePullPolicy: IfNotPresent
    cpu: "2"
    memory: "2Gi"
  securityContext:
    privileged: true
EOF

Hi, Try it, and can deploy,tks. By the way, connect cost long time still.

dashanji commented 3 weeks ago

I think it is likely to be caused by your environmental factors. Can RDMA work now?

hsh258 commented 3 weeks ago

I think it is likely to be caused by your environmental factors. Can RDMA work now?

Hi, Now RDMA doesn't work normal. It connect very slowly,and put and get data very slow,too.

hsh258 commented 3 weeks ago

I think it is likely to be caused by your environmental factors. Can RDMA work now?

Hi, I want to try to set server IP by fi_getinfo.Is the method feasible?If feasible,how to set? tks.

vegetableysm commented 3 weeks ago

I think it is likely to be caused by your environmental factors. Can RDMA work now?

Hi, I want to try to set server IP by fi_getinfo.Is the method feasible?If feasible,how to set? tks.

Refer to src/common/rdma/rdma_client.cc, src/common/rdma/rdma_server.cc and https://ofiwg.github.io/libfabric/

Vineyard client get the server RDMA device info by calling fi_getinfo with param of server ip address.

hsh258 commented 3 weeks ago

I think it is likely to be caused by your environmental factors. Can RDMA work now?

Hi, Now RDMA doesn't work normal. It connect very slowly,and put and get data very slow,too.

Hi, About speed of rdma write and read is slow,I find that the transmition is stucked for over 100 milliseconds every time hundreds or 1000 packages。 Two "RC send only qp=0x000274", are separated by over 100 milliseconds I want to try to change tx_ctx_cnt and rx_ctx_cntof verbs_domain_attr. Is the method feasible?Or other way? tks.