hhu-bsinfo / hadroNIO

Transparent acceleration for Java NIO applications via UCX
GNU General Public License v3.0
20 stars 4 forks source link

[BUG] Some data is lost during transmission. #1

Open XLzed opened 1 year ago

XLzed commented 1 year ago

Describe the bug Some data is lost during transmission,it causes the exception of grpc http2 deframe, and netty benchmark example hangs because of waiting for all data.

Steps to Reproduce

Additional info

fruhland commented 1 year ago

Can you please provide some information on your test system? Especially, which type of network interconnect are you using (Ethernet, InfiniBand, etc.)? The only error I recognize is "Stream x does not exist" from gRPC, but for me, it only occurs on a specific system and the benchmarks work fine on other systems.

XLzed commented 1 year ago

Can you please provide some information on your test system? Especially, which type of network interconnect are you using (Ethernet, InfiniBand, etc.)? The only error I recognize is "Stream x does not exist" from gRPC, but for me, it only occurs on a specific system and the benchmarks work fine on other systems.

I test it locally and the machine have no rdma device, so the examples run with tcp only (I also set UCX_TLS=tcp).

System Info

Sequence Number Test

I also add an additional seqNumber in the head of message to debug, and find that some messages are lost or not retrieved correctly . Some logs like: [WRN][HadronioSocketChannel] recv sequence number error, required [159], but get [290]

I also tested between two machines that supports ROCEv2, but the exception also occurred. Some information of rdma test environment:

I can use ucx and ibverbs to communicate directly, maybe the logic of tag_send/recv or of RingBuffer cause this problem?

XLzed commented 1 year ago

If I force the sendTaggedMessage to be blocking, the examples works fine.

//      final boolean completed = endpoint.sendTaggedMessage(sendBuffer.memoryAddress() + index, messageLength, tag, true, blocking);
        final boolean completed = endpoint.sendTaggedMessage(sendBuffer.memoryAddress() + index, messageLength, tag, true, true);
fruhland commented 1 year ago

Thanks for the detailed report. I will try to reproduce the issue and have a look into whats going wrong.

XLzed commented 1 year ago

It seems that tag matching semantic is not completed in order strictly. Maybe we have to deal with out-of-order, or use another semantic of UCX? I don't know if the data is still received in the same order as the receive buffer are submitted when the tasks can't complete in order.

fruhland commented 1 year ago

According to this (https://github.com/openucx/ucx/issues/6370), tag matching messages will be received in order.

  1. If I invoke two upc_tag_send_nb on same ep one by one,Will these two send requests will be completed in the invoke order?Does it matter with whether I use RC or not?

  2. They may be completed in a different order, but will be matched in the same order on receiver

Yangfisher1 commented 2 weeks ago

We encountered the same problem as "Frame of type 0 must be associated with a stream". It happened when testing a grpc demo replaced with hadroNIO if we were using tcp transport(on my local mac) or RDMA under RoCEv2 environment. However, when it switched to an InfiniBand cluster, everything worked well. Is the problem solved?

Yangfisher1 commented 2 weeks ago

We encountered the same problem as "Frame of type 0 must be associated with a stream". It happened when testing a grpc demo replaced with hadroNIO if we were using tcp transport(on my local mac) or RDMA under RoCEv2 environment. However, when it switched to an InfiniBand cluster, everything worked well. Is the problem solved?

It's not correctly. I found the problem might be due to the size of the RingBuffer. When I reduce the size of the data transfered by grpc, it works well.

Yangfisher1 commented 2 weeks ago

I don't know how to explain this, but when I set DEFAULT_BUFFER_SLICE_LENGTH to 16K, which is the maxium size of data frame in HTTP2, the problem disappeared.

XLzed commented 2 weeks ago

I don't know how to explain this, but when I set DEFAULT_BUFFER_SLICE_LENGTH to 16K, which is the maxium size of data frame in HTTP2, the problem disappeared.

There are some bugs, the transport protocol it implements does not guarantee that the received data can be processed in order. ucx's tag match semantics switches between eager and rndv based on dataSize to reduce latency by replacing multiple send with a single rdma read, which may cause callbacks using the rndv protocol are delayed, but do not affect the order in which buffers are received. However, the library uses the execution order of callback functions as the parsing order of the received buffer, resulting in a disordered packet order.

So I changed to use JUCX directly in my use case, which can avoid the data copy of ringbuffer meanwhile, but more code development is needed.

Yangfisher1 commented 2 weeks ago

@XLzed Thanks! The problem seems like a little bit tricky.

Actually we developed a version of using JUCX directly to transmit data in grpc. However, it needs to modify the rpc handler code and we want a transparent solution and the project seems like what we want. But it looks like far way from directly using it :joy:

fruhland commented 2 weeks ago

We tested on different setups with ConnectX-3 and ConnectX-5 cards and never encountered this problem. It seems like InfiniBand cards are not affected by this.

Yangfisher1 commented 2 weeks ago

@fruhland I tested the demo on a IB cluster and a RoCEv2 cluster. I think not the "IB cards" but the "IB switch" that prevent the problem. Because the RoCEv2 cluster also used CX6 while the underlying transport was based on UDP rather than IB.