Open hsh258 opened 1 month ago
Hi @hsh258, could you please use something like ib_write_bw or lib-fabric to check whether the rdma dev can work.
Hi @hsh258, could you please use something like ib_write_bw or lib-fabric to check whether the rdma dev can work.
Hi,there are some details: scene:in container fi_getinfo: return -FI_ENODATA
find / -name 'librdmacm*' 2>/dev/null /var/lib/dpkg/info/librdmacm1:amd64.shlibs /var/lib/dpkg/info/librdmacm1:amd64.triggers /var/lib/dpkg/info/librdmacm1:amd64.symbols /var/lib/dpkg/info/librdmacm1:amd64.md5sums /var/lib/dpkg/info/librdmacm1:amd64.list /var/cache/apt/archives/librdmacm1_28.0-1ubuntu1_amd64.deb /usr/lib/x86_64-linux-gnu/librdmacm.so /usr/lib/x86_64-linux-gnu/librdmacm.so.1.2.28.0 /usr/lib/x86_64-linux-gnu/librdmacm.so.1 /usr/share/doc/librdmacm1
find / -name 'libibverbs*' 2>/dev/null /etc/libibverbs.d /var/lib/dpkg/info/libibverbs1:amd64.md5sums /var/lib/dpkg/info/libibverbs1:amd64.shlibs /var/lib/dpkg/info/libibverbs-dev:amd64.list /var/lib/dpkg/info/libibverbs1:amd64.list /var/lib/dpkg/info/libibverbs-dev:amd64.md5sums /var/lib/dpkg/info/libibverbs1:amd64.postinst /var/lib/dpkg/info/libibverbs1:amd64.symbols /var/lib/dpkg/info/libibverbs1:amd64.triggers /var/cache/apt/archives/libibverbs1_28.0-1ubuntu1_amd64.deb /var/cache/apt/archives/libibverbs-dev_28.0-1ubuntu1_amd64.deb /usr/lib/x86_64-linux-gnu/pkgconfig/libibverbs.pc /usr/lib/x86_64-linux-gnu/libibverbs.so /usr/lib/x86_64-linux-gnu/libibverbs.a /usr/lib/x86_64-linux-gnu/libibverbs /usr/lib/x86_64-linux-gnu/libibverbs.so.1.8.28.0 /usr/lib/x86_64-linux-gnu/libibverbs.so.1 /usr/share/doc/libibverbs1 /usr/share/doc/libibverbs-dev
apt-cache search libfabric libfabric-bin - Diagnosis programs for the libfabric communication library libfabric-dev - Development files for libfabric1 libfabric1 - libfabric communication library
dpkg -l | grep libfabric ii libfabric1 1.6.2-3ubuntu0.1 amd64 libfabric communication library
however, it has no fi_info tool, can't check"fi_info -p verbs"
about "whether the rdma dev can work": the rdma dev can work, surely
fi_info
/usr/local/bin/.libs/fi_info: /lib/x86_64-linux-gnu/libfabric.so.1: version FABRIC_1.4' not found (required by /usr/local/bin/.libs/fi_info) /usr/local/bin/.libs/fi_info: /lib/x86_64-linux-gnu/libfabric.so.1: version
FABRIC_1.7' not found (required by /usr/local/bin/.libs/fi_info)
root@d0cf4f0fd8bb:/usr/local/bin# find / -name 'libfabric.so' 2>/dev/null
/usr/lib/x86_64-linux-gnu/libfabric.so.1
/usr/lib/x86_64-linux-gnu/libfabric.so.1.9.15
root@d0cf4f0fd8bb:/usr/local/bin# ls -l /usr/lib/x86_64-linux-gnu/libfabric.so.1
lrwxrwxrwx 1 root root 19 Nov 30 2022 /usr/lib/x86_64-linux-gnu/libfabric.so.1 -> libfabric.so.1.9.15
apt-cache search libfabric
Hi!Could you give me more details? For example, specific error messages like this:
And could you give me your command to run vineyardd? Thanks.
By the way, you can install the fabtest to run fi_info. Please make sure the fabtest version is compatible with libfabric.
apt-cache search libfabric
Hi!Could you give me more details? For example, specific error messages like this:
And could you give me your command to run vineyardd? Thanks.
By the way, you can install the fabtest to run fi_info. Please make sure the fabtest version is compatible with libfabric.
Hi, Here is command ./vineyardd --rdma_endpoint fd00:80:2200:3205::1207:b02
fi_info -p verbs fi_getinfo: -61 (No data available) lrwxrwxrwx. 1 root root 19 Oct 26 09:17 libfabric.so -> libfabric.so.1.24.0 lrwxrwxrwx. 1 root root 19 Oct 26 09:17 libfabric.so.1 -> libfabric.so.1.24.0 -rwxr-xr-x. 1 root root 1187520 Oct 26 09:17 libfabric.so.1.24.0
apt-cache search libfabric
Hi!Could you give me more details? For example, specific error messages like this: And could you give me your command to run vineyardd? Thanks. By the way, you can install the fabtest to run fi_info. Please make sure the fabtest version is compatible with libfabric.
Hi, Here is command ./vineyardd --rdma_endpoint fd00:80:2200:3205::1207:b02
fi_info -p verbs fi_getinfo: -61 (No data available) lrwxrwxrwx. 1 root root 19 Oct 26 09:17 libfabric.so -> libfabric.so.1.24.0 lrwxrwxrwx. 1 root root 19 Oct 26 09:17 libfabric.so.1 -> libfabric.so.1.24.0 -rwxr-xr-x. 1 root root 1187520 Oct 26 09:17 libfabric.so.1.24.0
Does "fd00:80:2200:3205::1207:b02" is an ipv6 address? Currently vineyard does not support ipv6 address resolution, please try it again with ipv4 address. Additionally, rdma devices requires root privileges. Are you doing this as root?
By the way, RDMA module of vineyard is based on libfabric, so if the fabric component "fi_info" can't see the information of RDMA device, vineyard can't get it either.
In addition, the param of "--rdma_endpoint" needs to specify port information for the address. Such as: ./vineyardd --rdma_endpoint=ipv4_addr:port
apt-cache search libfabric
Hi!Could you give me more details? For example, specific error messages like this: And could you give me your command to run vineyardd? Thanks. By the way, you can install the fabtest to run fi_info. Please make sure the fabtest version is compatible with libfabric.
Hi, Here is command ./vineyardd --rdma_endpoint fd00:80:2200:3205::1207:b02 fi_info -p verbs fi_getinfo: -61 (No data available) lrwxrwxrwx. 1 root root 19 Oct 26 09:17 libfabric.so -> libfabric.so.1.24.0 lrwxrwxrwx. 1 root root 19 Oct 26 09:17 libfabric.so.1 -> libfabric.so.1.24.0 -rwxr-xr-x. 1 root root 1187520 Oct 26 09:17 libfabric.so.1.24.0
Does "fd00:80:2200:3205::1207:b02" is an ipv6 address? Currently vineyard does not support ipv6 address resolution, please try it again with ipv4 address. Additionally, rdma devices requires root privileges. Are you doing this as root?
By the way, RDMA module of vineyard is based on libfabric, so if the fabric component "fi_info" can't see the information of RDMA device, vineyard can't get it either.
Hi, it is ip6 address。as root login
detail error info:
libfabric:2795:1730110628::core:core:fi_paramget():372
By the way,what time to support ip6?
I think one of the reasons vineyard was unable to create an RDMA server was because of the ipv6 address. But I don't know why fi_info can't get device information. If fi_info does not get device information, vineyard theoretically cannot get device information even if it is using ipv4.
And ipv6 support is not in our short-term plans at the moment. You can open a new issue about the ipv6 support and we may support ipv6 in the future. Thanks.
I think one of the reasons vineyard was unable to create an RDMA server was because of the ipv6 address. But I don't know why fi_info can't get device information. If fi_info does not get device information, vineyard theoretically cannot get device information even if it is using ipv4.
And ipv6 support is not in our short-term plans at the moment. You can open a new issue about the ipv6 support and we may support ipv6 in the future. Thanks.
Hi, Whether or not to install other package besides librdmacm.so and libibverbs.so in container scene ?For example ofed,and so on。 About fi_info,I use it by copy libfabric/util/fi_info and libfabric/util/.libs/ to container. Is this method okay? Use ip4,the appearance is same to ip6 .fi_getinfo return -FI_ENODATA too. In container(ip4 or ip6),use rdma link command,can look up rdma dev,but fi_info -p verebs ,has nothing. rdma link link mlx5_2/1 state DOWN physical_state DISABLED link mlx5_3/1 state DOWN physical_state DISABLED link mlx5_4/1 state ACTIVE physical_state LINK_UP link mlx5_5/1 state ACTIVE physical_state LINK_UP link mlx5_6/1 state DOWN physical_state DISABLED link mlx5_7/1 state DOWN physical_state DISABLED link mlx5_8/1 state ACTIVE physical_state LINK_UP link mlx5_9/1 state ACTIVE physical_state LINK_UP link mlx5_10/1 state DOWN physical_state DISABLED link mlx5_11/1 state DOWN physical_state DISABLED link mlx5_12/1 state ACTIVE physical_state LINK_UP link mlx5_13/1 state ACTIVE physical_state LINK_UP link mlx5_14/1 state DOWN physical_state DISABLED link mlx5_15/1 state DOWN physical_state DISABLED link mlx5_16/1 state ACTIVE physical_state LINK_UP link mlx5_17/1 state ACTIVE physical_state LINK_UP link mlx5_bond_1/1 state ACTIVE physical_state LINK_UP link mlx5_0/1 state ACTIVE physical_state LINK_UP link mlx5_1/1 state ACTIVE physical_state LINK_UP link mlx5_18/1 state DOWN physical_state DISABLED link mlx5_19/1 state DOWN physical_state DISABLED link mlx5_20/1 state DOWN physical_state DISABLED link mlx5_21/1 state DOWN physical_state DISABLED link mlx5_22/1 state DOWN physical_state DISABLED link mlx5_23/1 state DOWN physical_state DISABLED link mlx5_24/1 state DOWN physical_state DISABLED link mlx5_25/1 state DOWN physical_state DISABLED link mlx5_26/1 state DOWN physical_state DISABLED link mlx5_27/1 state DOWN physical_state DISABLED link mlx5_28/1 state DOWN physical_state DISABLED link mlx5_29/1 state DOWN physical_state DISABLED link mlx5_30/1 state DOWN physical_state DISABLED link mlx5_31/1 state DOWN physical_state DISABLED link mlx5_32/1 state DOWN physical_state DISABLED link mlx5_33/1 state DOWN physical_state DISABLED link mlx5_34/1 state DOWN physical_state DISABLED link mlx5_35/1 state DOWN physical_state DISABLED link mlx5_36/1 state DOWN physical_state DISABLED link mlx5_37/1 state DOWN physical_state DISABLED link mlx5_38/1 state DOWN physical_state DISABLED link mlx5_39/1 state DOWN physical_state DISABLED link mlx5_40/1 state DOWN physical_state DISABLED link mlx5_41/1 state DOWN physical_state DISABLED link mlx5_42/1 state DOWN physical_state DISABLED link mlx5_43/1 state DOWN physical_state DISABLED link mlx5_44/1 state DOWN physical_state DISABLED link mlx5_45/1 state DOWN physical_state DISABLED link mlx5_46/1 state DOWN physical_state DISABLED link mlx5_47/1 state DOWN physical_state DISABLED link mlx5_48/1 state ACTIVE physical_state LINK_UP netdev net_101 link mlx5_49/1 state DOWN physical_state DISABLED
Fabric depends on libibverbs. So libibverbs is necessery.
I suggests that you should install the libfabric and fabtest to use fi_info. Refer to the script below:
For fabric dependencies(CentOS):
yum -y install rdma-core libibverbs libibverbs-devel
Install fabric and fabtests
cd /tmp
wget https://github.com/ofiwg/libfabric/releases/download/v1.22.0/libfabric-1.22.0.tar.bz2
tar xf ./libfabric-1.22.0.tar.bz2
cd libfabric-1.22.0/
./configure --disable-usnic \
--disable-psm3 \
--disable-opx \
--disable-dmabuf_peer_mem \
--disable-hook_hmem \
--disable-hook_debug \
--disable-trace \
--disable-rxm \
--disable-psm2 \
--disable-xpmem \
--disable-shm \
--disable-rxd \
--disable-perf \
--disable-efa \
--disable-mrail \
--enable-verbs \
--with-cuda=no
make -j
make install
cd /tmp
wget https://github.com/ofiwg/libfabric/releases/download/v1.22.0/fabtests-1.22.0.tar.bz2
tar xf ./fabtests-1.22.0.tar.bz2
cd fabtests-1.22.0
./configure
make -j
make install
Again, vineyard compiles the fabric in the submodule itself, so in theory you only need the ibverbs library to use vineyard RDMA support (The premise is that fabric can also work alone). You can install the fabtests according to the script above and see if the fabtests works.(Such as fi_rma_bw / fi_info)
./vineyardd --rdma_endpoint=ipv4_addr:port
Hi ./vineyardd --rdma_endpoint=ipv4_addr:port The command ipv4_addr has or not mask? format is 1.2.3.4/24:1234 or 1.2.3.4:1234 ?
./vineyardd --rdma_endpoint=ipv4_addr:port
Hi ./vineyardd --rdma_endpoint=ipv4_addr:port The command ipv4_addr has or not mask? format is 1.2.3.4/24:1234 or 1.2.3.4:1234 ?
Without mask. It should be the ipv4 address of RDMA device.
./vineyardd --rdma_endpoint=ipv4_addr:port
Hi ./vineyardd --rdma_endpoint=ipv4_addr:port The command ipv4_addr has or not mask? format is 1.2.3.4/24:1234 or 1.2.3.4:1234 ?
The format of "ipv4:port" only affects the parsing of the port. The reason that it cannot use IPv6 is the same, as it parses the content after the first ":" as the port. The currently specified RDMA IPv4 address will not take effect; instead, it will automatically look for the first suitable RDMA device.
Refer to https://github.com/v6d-io/v6d/issues/2005 https://github.com/v6d-io/v6d/pull/2006
This WIP PR supports specifying a particular RDMA device by indicating its IPv4 address. However, it cannot be merged into the main branch for now because the CI failed. Refer to: https://github.com/v6d-io/v6d/issues/2008
Therefore, I suggest that it is a priority to ensure that fi_info can get the RDMA device information. If fi_info can retrieve the RDMA device information, then vineyard should initialize successfully. If fi_info cannot get the RDMA device information, then vineyard will not be able to initialize either.
./vineyardd --rdma_endpoint=ipv4_addr:port
Hi ./vineyardd --rdma_endpoint=ipv4_addr:port The command ipv4_addr has or not mask? format is 1.2.3.4/24:1234 or 1.2.3.4:1234 ?
The format of "ipv4:port" only affects the parsing of the port. The reason that it cannot use IPv6 is the same, as it parses the content after the first ":" as the port. The currently specified RDMA IPv4 address will not take effect; instead, it will automatically look for the first suitable RDMA device.
Refer to #2005 #2006
This WIP PR supports specifying a particular RDMA device by indicating its IPv4 address. However, it cannot be merged into the main branch for now because the CI failed. Refer to: #2008
Hi As I has Ip6 environment only,I will try Ip6.Whether or not is ok when i parses ipv6:port to get port? tks. By the way,as rdma client,read_env("VINEYARD_RDMA_ENDPOINT"), Who write VINEYARD_RDMA_ENDPOINT? If RPCServer has tcp server and RDMAServer with different port at the same time, Whether or not to conflict?tks
Whether or not is ok when i parses ipv6:port to get port?
No. But you can give a fake ipv4 address. Because vineyard will automatically look for the first suitable RDMA device. As I said above, specifying NIC by address will be supported in the next pr.
By the way,as rdma client,read_env("VINEYARD_RDMA_ENDPOINT"), Who write VINEYARD_RDMA_ENDPOINT?
If you provide rdma_endpoint when trying to connect to vineyardd, this environment variable will not be read. If you don't give it, it will try to read it. Environment variables are also set by the user.
If RPCServer has tcp server and RDMAServer with different port at the same time, Whether or not to conflict?
They won't conflict.
Whether or not is ok when i parses ipv6:port to get port?
No. But you can give a fake ipv4 address. Because vineyard will automatically look for the first suitable RDMA device. As I said above, specifying NIC by address will be supported in the next pr.
By the way,as rdma client,read_env("VINEYARD_RDMA_ENDPOINT"), Who write VINEYARD_RDMA_ENDPOINT?
If you provide rdma_endpoint when trying to connect to vineyardd, this environment variable will not be read. If you don't give it, it will try to read it. Environment variables are also set by the user.
If RPCServer has tcp server and RDMAServer with different port at the same time, Whether or not to conflict?
They won't conflict.
Hi as client,usr set VINEYARD_RDMA_ENDPOINT,for example 1.2.3.4:1234,is it client self address or server address ? tks
as client,usr set VINEYARD_RDMA_ENDPOINT,for example 1.2.3.4:1234,is it client self address or server address ? tks
The client should use the exact ipv4 address of vineyard server. Suppose the NIC of the server is at the address 1.2.3.4, and vineyard RDMA server use port of 1234. You should use 1.2.3.4:1234 as the VINEYARD_RDMA_ENDPOINT of client. It is also currently not possible for client to specify the NIC used to send the data, so this field means the address of the server.
As I said above, specifying NIC by address will be supported in the next pr. These include the NIC used by the server to receive data and the NIC used by the client to send data.
To summarize, there is currently no way for the server to specify the NIC, and the server will automatically select the appropriate NIC to listen to RDMA messages. The rdma endpoint on the client side is the address of the server. It is also currently not possible for client to specify the NIC used for sending data. The feature to specify NIC will be supported in the pr mentioned above, but cannot currently be merged into the main branch.
as client,usr set VINEYARD_RDMA_ENDPOINT,for example 1.2.3.4:1234,is it client self address or server address ? tks
The client should use the exact ipv4 address of vineyard server. Suppose the NIC of the server is at the address 1.2.3.4, and vineyard RDMA server use port of 1234. You should use 1.2.3.4:1234 as the VINEYARD_RDMA_ENDPOINT of client. It is also currently not possible for client to specify the NIC used to send the data, so this field means the address of the server.
As I said above, specifying NIC by address will be supported in the next pr. These include the NIC used by the server to receive data and the NIC used by the client to send data.
To summarize, there is currently no way for the server to specify the NIC, and the server will automatically select the appropriate NIC to listen to RDMA messages. The rdma endpoint on the client side is the address of the server. It is also currently not possible for client to specify the NIC used for sending data. The feature to specify NIC will be supported in the pr mentioned above, but cannot currently be merged into the main branch.
Hi Server has start,as server config itself container address,--rdma_endpoint 60.30.10.50:9600 I20241031 15:04:05.911379 7 rpc_server.cc:105] Vineyard will listen on 0.0.0.0:9600 for RPC I20241031 15:04:08.535791 7 rpc_server.cc:109] Vineyard will listen on 60.30.10.2:9600 for RDMA I20241031 15:04:08.537207 7 meta_service.cc:1195] Instance join: 0 Above,60.30.10.50 is rdma net dev address.
However,detail show rpc service is TCP, no RDMA. kubectl get services -n admin |grep vineyard vineyard-controller-manager-metrics-service ClusterIP 8443/TCP 21m vineyard-webhook-service ClusterIP 443/TCP 21m vineyardd-sample-etcd-service ClusterIP 2379/TCP 8m18s vineyardd-sample-redis-0 ClusterIP 6379/TCP 8m18s vineyardd-sample-redis-service ClusterIP 6379/TCP 8m18s vineyardd-sample-rpc ClusterIP 102.11.82.204 9600/TCP 8m17s
At the same time , As client, if VINEYARD_RDMA_ENDPOINT set to server address 60.30.10.50:9600,show Connect RDMA server failed! Fall back to RPC mode. Error:fi_getinfo failed. Why?client can't find server? As client, if VINEYARD_RDMA_ENDPOINT set to itself address 60.30.10.51:9600,show rpc_client = vineyard.connect('vineyardd-sample-rpc.admin',9600) Connect rdma server failed! retry: 1 times. it show cilent can find driver. Above server and client are in the same node.
I20241031 15:04:05.911379 7 rpc_server.cc:105] Vineyard will listen on 0.0.0.0:9600 for RPC I20241031 15:04:08.535791 7 rpc_server.cc:109] Vineyard will listen on 60.30.10.2:9600 for RDMA
Do not make RDMA and RPC work on the same port if they use the same NIC.
As client, if VINEYARD_RDMA_ENDPOINT set to server address 60.30.10.50:9600,show Connect RDMA server failed! Fall back to RPC mode. Error:fi_getinfo failed. Why?client can't find server? As client, if VINEYARD_RDMA_ENDPOINT set to itself address 60.30.10.51:9600,show rpc_client = vineyard.connect('vineyardd-sample-rpc.admin',9600)
Are the client and server in the same container? If not, can the client's container connect to the server?
You can start the server and client in the same container to test if the vineyard RDMA works.
As client, if VINEYARD_RDMA_ENDPOINT set to server address 60.30.10.50:9600,show Connect RDMA server failed! Fall back to RPC mode. Error:fi_getinfo failed. Why?client can't find server? As client, if VINEYARD_RDMA_ENDPOINT set to itself address 60.30.10.51:9600,show rpc_client = vineyard.connect('vineyardd-sample-rpc.admin',9600)
Are the client and server in the same container? If not, can the client's container connect to the server?
You can start the server and client in the same container to test if the vineyard RDMA works.
Hi Now ,when client put data,then it has issue,Why? As client:
import numpy as np import vineyard rpc_client = vineyard.connect('vineyardd-sample-rpc.admin',9600) Connected to RPC server: vineyardd-sample-rpc.admin:9600, RDMA server: 10.11.228.2:9600 objid = rpc_client.put(np.zeros(8)) mlx5: vineyard-python-client-847f8b8b86-s9t55: got completion with error: 00000000 00000000 00000000 00000000 00000000 00000000 00000000 00000000 00000000 00000000 00000000 00000000 00000000 00008813 08000241 0001fdd2 Traceback (most recent call last): File "
", line 1, in File "/usr/local/lib/python3.8/dist-packages/vineyard/core/client.py", line 823, in put return put(self, value, builder, persist, name, kwargs) File "/usr/local/lib/python3.8/dist-packages/vineyard/core/builder.py", line 197, in put meta = get_current_builders().run(client, value, kwargs) File "/usr/local/lib/python3.8/dist-packages/vineyard/core/builder.py", line 100, in run return self._factory[ty](client, value, **kw) File "/usr/local/lib/python3.8/dist-packages/vineyard/data/tensor.py", line 90, in numpy_ndarray_builder meta.addmember('buffer', build_numpy_buffer(client, value)) File "/usr/local/lib/python3.8/dist-packages/vineyard/data/utils.py", line 178, in build_numpy_buffer return build_buffer(client, address, array.nbytes) File "/usr/local/lib/python3.8/dist-packages/vineyard/data/utils.py", line 157, in build_buffer return client.create_remote_blob(buffer) File "/usr/local/lib/python3.8/dist-packages/vineyard/core/client.py", line 575, in create_remote_blob return self.rpc_client.create_remote_blob(blob_builder) vineyard._C.InvalidException: Invalid: GetTXCompletion failed:-5
As server: E20241103 02:35:23.640290 175 rpc_server.cc:203] Receive vineyard request mem! E20241103 02:35:23.640353 175 rpc_server.cc:208] Receive remote request address: 0x7f04ebbfe040 size: 64 E20241103 02:35:23.640825 175 rpc_server.cc:238] Failed to register mem. E20241103 02:35:23.645750 273 rpc_server.cc:389] Connection error!Client crashed.
E20241103 02:53:37.765411 178 rpc_server.cc:203] Receive vineyard request mem! E20241103 02:53:37.765699 178 rpc_server.cc:208] Receive remote request address: 0x7f04ebffe100 size: 4194304 E20241103 02:53:37.765759 178 rpc_server.cc:238] Failed to register mem. E20241103 02:53:37.771457 273 rpc_server.cc:389] Connection error!Client crashed.
E20241103 03:03:53.214725 181 rpc_server.cc:203] Receive vineyard request mem! E20241103 03:03:53.214792 181 rpc_server.cc:208] Receive remote request address: 0x7f04ec3fe140 size: 8192 E20241103 03:03:53.214845 181 rpc_server.cc:238] Failed to register mem. E20241103 02:53:37.771457 273 rpc_server.cc:389] Connection error!Client crashed.
As the same client,don't change anything, it is sometimes possible to put success, but sometimes fail. I currently can only find the focus is func fi_mr_regattr.But what time success,what time fail as the same envirment?
Hi. Could you please show me the complete instructions to start vineyardd and the code of putting object on the client side? Let me test it locally.
Hi. Could you please show me the complete instructions to start vineyardd and the code of putting object on the client side? Let me test it locally.
Hi, instructions are there: use python client, login export VINEYARD_RDMA_ENDPOINT=10.13.228.2:9600 // 10.13.228.2 is the rdma server address python3
import numpy as np import vineyard rpc_client = vineyard.connect('vineyardd-sample-rpc.admin',9600) //it needs adout 20s success objid = rpc_client.put(np.zeros(8)) // it is sometimes possible to put success, but sometimes fail.
Hi. Could you please show me the complete instructions to start vineyardd and the code of putting object on the client side? Let me test it locally.
Hi, instructions are there: use python client, login export VINEYARD_RDMA_ENDPOINT=10.13.228.2:9600 // 10.13.228.2 is the rdma server address python3
import numpy as np import vineyard rpc_client = vineyard.connect('vineyardd-sample-rpc.admin',9600) //it needs adout 20s success objid = rpc_client.put(np.zeros(8)) // it is sometimes possible to put success, but sometimes fail.
And the command of starting a vineyardd?
Hi. Could you please show me the complete instructions to start vineyardd and the code of putting object on the client side? Let me test it locally.
Hi, instructions are there: use python client, login export VINEYARD_RDMA_ENDPOINT=10.13.228.2:9600 // 10.13.228.2 is the rdma server address python3
import numpy as np import vineyard rpc_client = vineyard.connect('vineyardd-sample-rpc.admin',9600) //it needs adout 20s success objid = rpc_client.put(np.zeros(8)) // it is sometimes possible to put success, but sometimes fail.
And the command of starting a vineyardd? as server, set: json RpcSpecResolver::resolve() const { json spec; spec["rpc"] = FLAGS_rpc; spec["port"] = FLAGS_rpc_socket_port; spec["rdma_endpoint"] = "10.13.228.2:9600"; return spec; } then deployment vineyard use helm chart
And If registering memory fails, try increasing the vineyard's available memory.
And If registering memory fails, try increasing the vineyard's available memory. Hi as 2Gi memory,it is same, sometimes possible to put success, but sometimes fail. cat <<EOF | kubectl apply -f - apiVersion: k8s.v6d.io/v1alpha1 kind: Vineyardd metadata: name: vineyardd-sample namespace: admin spec: replicas: 1 metaServiceReplicas: 1 service: type: ClusterIP port: 9600 vineyard: image: test:test/admin/vineyardd:v1 imagePullPolicy: IfNotPresent cpu: "2" memory: "2Gi" securityContext: privileged: true EOF
Hi @hsh258. The 9600 is the default port for RPC, you should define another unique port for RDMA endpoint such as "10.13.228.2:9601"
Hi @hsh258. The 9600 is the default port for RPC, you should define another unique port for RDMA endpoint such as "10.13.228.2:9601"
Hi , If set other port, connect fail,show:
Python 3.8.10 (default, Sep 11 2024, 16:02:53) [GCC 9.4.0] on linux Type "help", "copyright", "credits" or "license" for more information.
import numpy as np import vineyard rpc_client = vineyard.connect('vineyardd-sample-rpc.admin',19601) Traceback (most recent call last): File "
", line 1, in File "/usr/local/lib/python3.8/dist-packages/vineyard/init.py", line 418, in connect return Client(*args, *kwargs) File "/usr/local/lib/python3.8/dist-packages/vineyard/core/client.py", line 296, in init raise ConnectionError( ConnectionError: Failed to connect to vineyard via both IPC and RPC connection. Arguments, environment variables VINEYARD_IPC_SOCKET
andVINEYARD_RPC_ENDPOINT
, as well as the configuration file, are all unavailable. rpc_client = vineyard.connect('10.13.228.2',19601) Traceback (most recent call last): File "", line 1, in args, **kwargs) File "/usr/local/lib/python3.8/dist-packages/vineyard/core/client.py", line 296, in init raise ConnectionError( ConnectionError: Failed to connect to vineyard via both IPC and RPC connection. Arguments, environment variablesFile "/usr/local/lib/python3.8/dist-packages/vineyard/init.py", line 418, in connect return Client( VINEYARD_IPC_SOCKET
andVINEYARD_RPC_ENDPOINT
, as well as the configuration file, are all unavailable.
The rpc must be connected at first while using the rdma, you can try the following code.
export VINEYARD_RDMA_ENDPOINT=10.13.228.2:19601
import numpy as np
import vineyard
rpc_client = vineyard.connect('vineyardd-sample-rpc.admin',9600)
The rpc must be connected at first while using the rdma, you can try the following code.
export VINEYARD_RDMA_ENDPOINT=10.13.228.2:19601 import numpy as np import vineyard rpc_client = vineyard.connect('vineyardd-sample-rpc.admin',9600)
Hi, accord to above order, The issue is same,especially when server and client are in the different node. In server: rpc_server.cc:105] Vineyard will listen on 0.0.0.0:9600 for RPC rpc_server.cc:109] Vineyard will listen on 13.13.229.3:19800 for RDMA rpc_server.cc:203] Receive vineyard request mem! rpc_server.cc:208] Receive remote request address: 0x7fe39bfff040 size: 31457280 rpc_server.cc:241] Failed to register mem. size 31457280 rpc_server.cc:392] Connection error!Client crashed. rpc_server.cc:334] Receive close msg! rpc_server.cc:400] Get RX completion failed! Error:Client crashed.
In client:
mlx5: vineyard-python-client-65dfffb656-ghc72: got completion with error:
00000000 00000000 00000000 00000000
00000000 00000000 00000000 00000000
00000000 00000000 00000000 00000000
00000000 00008813 080004bc 000106d2
Traceback (most recent call last):
File "
Could you please add an option (size:1024Mi) to the vineyard yaml as follows and try again?
Hi
as 2Gi memory,it is same, sometimes possible to put success, but sometimes fail.
cat <<EOF | kubectl apply -f -
apiVersion: k8s.v6d.io/v1alpha1
kind: Vineyardd
metadata:
name: vineyardd-sample
namespace: admin
spec:
replicas: 1
metaServiceReplicas: 1
service:
type: ClusterIP port: 9600
vineyard: size:1024Mi image: test:test/admin/vineyardd:v1
imagePullPolicy: IfNotPresent cpu: "2" memory: "2Gi"
securityContext:
privileged: true
EOF
size:1024Mi
Hi deploy has issue in "size" , size:1024Mi error: error parsing STDIN: error converting YAML to JSON: yaml: line 16: mapping values are not allowed in this context size:"1024Mi" error: error parsing STDIN: error converting YAML to JSON: yaml: line 16: mapping values are not allowed in this context
Sorry for the misleading indentation. You could try the following command.
cat <<EOF | kubectl apply -f -
apiVersion: k8s.v6d.io/v1alpha1
kind: Vineyardd
metadata:
name: vineyardd-sample
namespace: admin
spec:
replicas: 1
service:
type: ClusterIP
port: 9600
vineyard:
size: 1024Mi
image: test:test/admin/vineyardd:v1
imagePullPolicy: IfNotPresent
cpu: "2"
memory: "2Gi"
securityContext:
privileged: true
EOF
Sorry for the misleading indentation. You could try the following command.
cat <<EOF | kubectl apply -f - apiVersion: k8s.v6d.io/v1alpha1 kind: Vineyardd metadata: name: vineyardd-sample namespace: admin spec: replicas: 1 service: type: ClusterIP port: 9600 vineyard: size: 1024Mi image: test:test/admin/vineyardd:v1 imagePullPolicy: IfNotPresent cpu: "2" memory: "2Gi" securityContext: privileged: true EOF
Hi, above issue, error: error parsing STDIN: error converting YAML to JSON: yaml: line 16: mapping values are not allowed in this context As deply: cat <<EOF | kubectl apply -f - apiVersion: k8s.v6d.io/v1alpha1 kind: Vineyardd metadata: name: vineyardd-sample namespace: admin spec: replicas: 1 service: type: ClusterIP port: 9600 vineyard: size: 1024Mi image: test:test/admin/vineyardd:v1 imagePullPolicy: IfNotPresent cpu: "2" memory: "2Gi" securityContext: privileged: true EOF
The rpc must be connected at first while using the rdma, you can try the following code.
export VINEYARD_RDMA_ENDPOINT=10.13.228.2:19601 import numpy as np import vineyard rpc_client = vineyard.connect('vineyardd-sample-rpc.admin',9600)
Hi, As exec "rpc_client = vineyard.connect('vineyardd-sample-rpc.admin',9600)",it cost long time,from a few seconds to over 20 seconds. add debug info,it focus the func CHECK_ERROR(!fi_getinfo(VINEYARD_FIVERSION, server_address.c_str(), std::to_string(port).c_str(), 0, hints, reinterpret_cast<fi_info**>(&(info.fi))) Could you tell me how to shorten the time? tks
Sorry for the misleading indentation. You could try the following command.
cat <<EOF | kubectl apply -f - apiVersion: k8s.v6d.io/v1alpha1 kind: Vineyardd metadata: name: vineyardd-sample namespace: admin spec: replicas: 1 service: type: ClusterIP port: 9600 vineyard: size: 1024Mi image: test:test/admin/vineyardd:v1 imagePullPolicy: IfNotPresent cpu: "2" memory: "2Gi" securityContext: privileged: true EOF
Hi, above issue, error: error parsing STDIN: error converting YAML to JSON: yaml: line 16: mapping values are not allowed in this context As deply: cat <<EOF | kubectl apply -f - apiVersion: k8s.v6d.io/v1alpha1 kind: Vineyardd metadata: name: vineyardd-sample namespace: admin spec: replicas: 1 service: type: ClusterIP port: 9600 vineyard: size: 1024Mi image: test:test/admin/vineyardd:v1 imagePullPolicy: IfNotPresent cpu: "2" memory: "2Gi" securityContext: privileged: true EOF
Does it work now? I think It should be caused by the indentation problem.
The rpc must be connected at first while using the rdma, you can try the following code.
export VINEYARD_RDMA_ENDPOINT=10.13.228.2:19601 import numpy as np import vineyard rpc_client = vineyard.connect('vineyardd-sample-rpc.admin',9600)
Hi, As exec "rpc_client = vineyard.connect('vineyardd-sample-rpc.admin',9600)",it cost long time,from a few seconds to over 20 seconds. add debug info,it focus the func CHECK_ERROR(!fi_getinfo(VINEYARD_FIVERSION, server_address.c_str(), std::to_string(port).c_str(), 0, hints, reinterpret_cast<fi_info**>(&(info.fi))) Could you tell me how to shorten the time? tks
It shouldn't be very slow, what's your k8s environment (ack/aws/...) and machine environment ?
Sorry for the misleading indentation. You could try the following command.
cat <<EOF | kubectl apply -f - apiVersion: k8s.v6d.io/v1alpha1 kind: Vineyardd metadata: name: vineyardd-sample namespace: admin spec: replicas: 1 service: type: ClusterIP port: 9600 vineyard: size: 1024Mi image: test:test/admin/vineyardd:v1 imagePullPolicy: IfNotPresent cpu: "2" memory: "2Gi" securityContext: privileged: true EOF
Hi, above issue, error: error parsing STDIN: error converting YAML to JSON: yaml: line 16: mapping values are not allowed in this context As deply: cat <<EOF | kubectl apply -f - apiVersion: k8s.v6d.io/v1alpha1 kind: Vineyardd metadata: name: vineyardd-sample namespace: admin spec: replicas: 1 service: type: ClusterIP port: 9600 vineyard: size: 1024Mi image: test:test/admin/vineyardd:v1 imagePullPolicy: IfNotPresent cpu: "2" memory: "2Gi" securityContext: privileged: true EOF
Does it work now? I think It should be caused by the indentation problem.
Hi, It always has the issue,so current delete "size: 1024Mi" when deploy it.
The rpc must be connected at first while using the rdma, you can try the following code.
export VINEYARD_RDMA_ENDPOINT=10.13.228.2:19601 import numpy as np import vineyard rpc_client = vineyard.connect('vineyardd-sample-rpc.admin',9600)
Hi, As exec "rpc_client = vineyard.connect('vineyardd-sample-rpc.admin',9600)",it cost long time,from a few seconds to over 20 seconds. add debug info,it focus the func CHECK_ERROR(!fi_getinfo(VINEYARD_FIVERSION, server_address.c_str(), std::to_string(port).c_str(), 0, hints, reinterpret_cast<fi_info**>(&(info.fi))) Could you tell me how to shorten the time? tks
It shouldn't be very slow, what's your k8s environment (ack/aws/...) and machine environment ?
Hi, kubectl Version: v1.28.3
cat <<EOF | kubectl apply -f - apiVersion: k8s.v6d.io/v1alpha1 kind: Vineyardd metadata: name: vineyardd-sample namespace: admin spec: replicas: 1 service: type: ClusterIP port: 9600 vineyard: size: 1024Mi image: test:test/admin/vineyardd:v1 imagePullPolicy: IfNotPresent cpu: "2" memory: "2Gi" securityContext: privileged: true EOF
How do you install the vineyard operator? Besides, can you copy the code to the shell and try again, it's better to show the failed screenshot so that we can check where is wrong.
cat <<EOF | kubectl apply -f - apiVersion: k8s.v6d.io/v1alpha1 kind: Vineyardd metadata: name: vineyardd-sample namespace: admin spec: replicas: 1 service: type: ClusterIP port: 9600 vineyard: size: 1024Mi image: test:test/admin/vineyardd:v1 imagePullPolicy: IfNotPresent cpu: "2" memory: "2Gi" securityContext: privileged: true EOF
How do you install the vineyard operator? Besides, can you copy the code to the shell and try again, it's better to show the failed screenshot so that we can check where is wrong.
Hi, Use helm chart to install the vineyard operator
error: error parsing STDIN: error converting YAML to JSON: yaml: line 16: mapping values are not allowed in this context This error shouldn't happen. In my test environment, it can work fine as follows.
$ helm repo update $ kubectl create namespace vineyard-system $ helm install vineyard-operator vineyard/vineyard-operator -n vineyard-system $ cat <<EOF | kubectl apply -f - apiVersion: k8s.v6d.io/v1alpha1 kind: Vineyardd metadata: name: vineyardd-sample namespace: admin spec: replicas: 1 service: type: ClusterIP port: 9600 vineyard: size: 1024Mi image: test:test/admin/vineyardd:v1 imagePullPolicy: IfNotPresent cpu: "2" memory: "2Gi" securityContext: privileged: true EOF
error: error parsing STDIN: error converting YAML to JSON: yaml: line 16: mapping values are not allowed in this context This error shouldn't happen. In my test environment, it can work fine as follows.
$ helm repo update $ kubectl create namespace vineyard-system $ helm install vineyard-operator vineyard/vineyard-operator -n vineyard-system $ cat <<EOF | kubectl apply -f - apiVersion: k8s.v6d.io/v1alpha1 kind: Vineyardd metadata: name: vineyardd-sample namespace: admin spec: replicas: 1 service: type: ClusterIP port: 9600 vineyard: size: 1024Mi image: test:test/admin/vineyardd:v1 imagePullPolicy: IfNotPresent cpu: "2" memory: "2Gi" securityContext: privileged: true EOF
Hi, Try it, and can deploy,tks. By the way, connect cost long time still.
I think it is likely to be caused by your environmental factors. Can RDMA work now?
I think it is likely to be caused by your environmental factors. Can RDMA work now?
Hi, Now RDMA doesn't work normal. It connect very slowly,and put and get data very slow,too.
I think it is likely to be caused by your environmental factors. Can RDMA work now?
Hi, I want to try to set server IP by fi_getinfo.Is the method feasible?If feasible,how to set? tks.
I think it is likely to be caused by your environmental factors. Can RDMA work now?
Hi, I want to try to set server IP by fi_getinfo.Is the method feasible?If feasible,how to set? tks.
Refer to src/common/rdma/rdma_client.cc, src/common/rdma/rdma_server.cc and https://ofiwg.github.io/libfabric/
Vineyard client get the server RDMA device info by calling fi_getinfo with param of server ip address.
I think it is likely to be caused by your environmental factors. Can RDMA work now?
Hi, Now RDMA doesn't work normal. It connect very slowly,and put and get data very slow,too.
Hi, About speed of rdma write and read is slow,I find that the transmition is stucked for over 100 milliseconds every time hundreds or 1000 packages。 Two "RC send only qp=0x000274", are separated by over 100 milliseconds I want to try to change tx_ctx_cnt and rx_ctx_cntof verbs_domain_attr. Is the method feasible?Or other way? tks.
rpc_server.cc:112] Init RDMA failed!Create rdma server failed!
Describe your problem
A clear and concise description of what your problem is. It might be a bug, a feature request, or just a problem that need support from the vineyard team.
If is is a bug report, to help us reproducing this bug, please provide information below:
uname -a
):vineyard.__version__
):If it is a feature request, please provides a clear and concise description of what you want to happen:
What is the problem:
The behaviour that you expect to work:
Additional context
Add any other context about the problem here.