Open dennis-tra opened 9 months ago
Just out of interest can you reproduce it when every host is given a unique and deterministic port, e.g. by using the loop count in newHost
Just tried it in this branch: https://github.com/dennis-tra/newstream-block/tree/port
No change - it still blocks sometimes.
I'm having trouble reproducing this locally. Could you try using an older libp2p version? We made some changes to these code paths recently.
cc @sukunrt
I tried it with v0.30.0, and it's also happening with that version. Haven't tried older versions yet.
Note again, that I also cannot reproduce it on my Mac. I can only semi-reliably reproduce it on my linux box.
Some racy behaviour with tcp simultaneous connect and reuseport with an Accept on the same socket as the reused port.
In the event that the dial times out the sequence of tcp conn establishment packets is:
1. A -> B: SYN
2. BListener -> A: SYN-ACK
3. BDialer -> A: SYN
4. A -> B: ACK
5. A -> B: Duplicate ACK for the SYN from B Dialer
The connection from A to B is established and is received in the Accept loop after step 4. But the BDialer keeps on sending SYN packets because it expects the remote to send a SYN which it never does only ACKing the received SYN packet.
Here the failure case happens for addresses /ip4/127.0.0.1/tcp/40797
/ip4/127.0.0.1/tcp/41331
I'll debug more how mac behaviour defers here. This is on go-libp2p v0.26, before all the swarm changes. Note: Most of these NewStream calls would be erroring on master as we do not support tcp simultaneous connect any more.
I wish I could tell you to use quic. But looks like there's a problem with simultaneous udp messages on linux :( I'm getting this error sometimes:
=== RUN TestBlock/[16]
fail: failed to dial: failed to dial 12D3KooWM8LoGTwRMZtq5zXpRVxUX9smFLEs9zfir5sqGSgjJec2: all dials failed
* [/ip4/127.0.0.1/udp/33017/quic-v1] INTERNAL_ERROR (local): write udp4 127.0.0.1:46972->127.0.0.1:33017: sendmsg: operation not permitted
If I add a 10ms delay to one of the goroutines this error goes away. Tried this on two different linux machines.
@marten-seemann any idea why this happens?
The patch if you want to test this:
diff --git a/main_test.go b/main_test.go
index 42df802..2b1e29f 100644
--- a/main_test.go
+++ b/main_test.go
@@ -12,7 +12,7 @@ import (
)
func newHost(t *testing.T) host.Host {
- listenAddr := libp2p.ListenAddrStrings("/ip4/127.0.0.1/tcp/0")
+ listenAddr := libp2p.ListenAddrStrings("/ip4/127.0.0.1/udp/0/quic-v1")
h, err := libp2p.New(listenAddr)
require.NoError(t, err)
@@ -38,12 +38,18 @@ func TestBlock(t *testing.T) {
wg.Add(2)
go func() {
defer wg.Done()
- _, _ = h1.NewStream(context.Background(), h2.ID(), "/any/protocol")
+ _, err := h1.NewStream(context.Background(), h2.ID(), "/any/protocol")
+ if err != nil {
+ fmt.Println("fail:", err)
+ }
}()
go func() {
defer wg.Done()
- _, _ = h2.NewStream(context.Background(), h1.ID(), "/any/protocol")
+ _, err := h2.NewStream(context.Background(), h1.ID(), "/any/protocol")
+ if err != nil {
+ fmt.Println("fail:", err)
+ }
}()
wg.Wait()
If I add a 10ms delay to one of the goroutines this error goes away. Tried this on two different linux machines.
@marten-seemann any idea why this happens?
This seems to be a standard library bug. I opened https://github.com/golang/go/issues/63322 with a minimal example.
In the TCP case, is there anything we can do in the B->A case when B waits for SYN packets from A? I would have expected that B will stop waiting for SYNs if the connection from A->B succeeds. Btw, the NewStream
calls succeed on my machine. They just take 5s longer. I guess further up the stack, the connection from A->B is identified and then used for the new stream. Haven't looked deeply into that.
I would have expected that B will stop waiting for SYNs if the connection from A->B succeeds.
Same. I'm not sure what's happening in the kernel here.
Btw, the NewStream calls succeed on my machine.
The call succeeds because of this line: https://github.com/libp2p/go-libp2p/blob/master/p2p/net/swarm/dial_worker.go#L376
Here the dial_worker_loop before exiting checks if somehow a connection has appeared, which in this case it has. This check exists because a concurrent dial might add different addresses to the peerstore and trigger the dial, this helps check if any of the newer addresses have succeeded.
For the solution, we need to ensure that a dial call also considers any incoming connections that have been established in the meantime.
I'd like to implement some variant of this logic: https://github.com/libp2p/go-libp2p/issues/1603#issuecomment-1160780017
The dial worker loop might be simplified by uncoupling the dial result from the specific dial call. Maybe by having an Add(ma.Multiaddr) function that joins a new dial job, and then Get() <-chan network.Conn function that returns a channel with all successfully established functions (maybe pass a filter function to Get)
either this or to setup a notification of incoming connections here: https://github.com/libp2p/go-libp2p/blob/master/p2p/net/swarm/dial_sync.go#L61 making it something like.
select {
case res := <-resch:
return res.conn, res.err
case conn := <-incommingConns[peer]:
return conn, nil
case <-ctx.Done():
return nil, ctx.Err()
}
Simultaneously calling
NewStream
blocks on one side until the dial timeout was reached. Minimum repro example:I cannot reproduce it on my Mac, but on my Linux server. I need to run the test a few times because it's only sometimes happening. This is an example output:
Here's a repository with the minimum repro example: https://github.com/dennis-tra/newstream-block If you go back a few commits, you can see my debugging attempts.
In the blocking case I saw that both hosts are listening on a TCP port but one host stalls at
maDial
. At the same time that stalled host accepts the dial attempt from the other host. So, there is a successful connection, but theNewStream
/Connect
call blocks until the dial attempt has timed out. When the dial has timed out,NewStream
seems to find the successful connection that the other host established, open a stream, and return it as everything worked fine.