nochowderforyou / clams

Clam Project
MIT License
62 stars 58 forks source link

Low network connections in 1.4.5, did not have problem in 1.4.3 #137

Closed arcoso closed 9 years ago

arcoso commented 9 years ago

Only getting 4 network connections.

I did not have this problem in the previous "stable" version 1.4.3

I hope you please look into it, and thank you for all your work!!

creativecuriosity commented 9 years ago

Awesome arcoso, thank you for taking the time to get involved :) Will look into it!

dooglus commented 9 years ago

I have two instances of v1.4.5 running:

$ clamd getpeerinfo | grep inbound | sort | uniq -c
     12         "inbound" : false,

$ clamd getpeerinfo | grep inbound | sort | uniq -c
      4         "inbound" : false,

It looks like neither has its p2p port forwarded to allow inbound connections, and so all the connections are outbound. This proves that v1.4.5 is capable of making more than 4 connections.

I'll try forwarding the port and seeing how it goes then.

...

Oh, this is interesting. The one with 12 outbound connections already had its port forwarded. I shut down the one with 4 connections and restarted it with -connect=w.x.y.z and it was able to connect.

The other one then showed:

$ clamd getpeerinfo | grep inbound | sort | uniq -c
     11         "inbound" : false,
      1         "inbound" : true,

I stopped the "-connect=" client, restarted it without arguments, and it didn't reconnect to the one with the 12 peers again.

So perhaps v1.4.5 clients are somehow failing to advertise themselves properly?

12 hours later, I still see no inbound connections even though if a peer tried to connect it would be accepted.

dooglus commented 9 years ago

I tried using -connect=... to connect to two different nodes.

When I connect to a node running an old version of the CLAM client, my local log shows:

receive version message: version 60014, blocks=329196, us=<myip>:<randomport>,
  them=<theirip>:31174, peer=<theirip>:31174

but when I connect to a peer running v1.4.5 I see:

receive version message: version 60014, blocks=329196, us=<myip>:<randomport>,
  them=0.0.0.0:31174, peer=<theirip>:31174

I always see them=0.0.0.0 on the peer receiving the connection, but on old versions of the client 'them' was non-zero on the peer initiating the connection.

I wonder if that's significant.

dooglus commented 9 years ago

I've hit a bit of a dead end.

It seems like the reason we're not getting incoming connections is because we can't figure out our own external IP address.

If you supply -externalip= then it all works fine.

4d8f0a58 chanced how we discover our external IP address. It's based on https://github.com/bitcoin/bitcoin/pull/5161

According to the guys in #bitcoin-dev, a node behind a NAT router with a port forwarded will eventually start getting incoming connections without us having to manually provide the public IP address.

I'm currently syncing the Bitcoin blockchain to see that actually happen. Then I'll be able to find out how it happens. And then I'll see why it doesn't happen for CLAM...

dooglus commented 9 years ago

I think I found the problem.

4d8f0a58 writes:

    pfrom->addrLocal = addrMe;
    if (pfrom->fInbound && addrMe.IsRoutable())
    {
        SeenLocal(addrMe);
    }

then c8f8b500 removes that code completely. There's no mention of SeenLocal() after that commit. This is probably the cause of the problem. That commit is removing 'addrSeenByPeer' and accidentally removes code around 'SeenLocal()', which is unrelated.

Then 66493570 puts the code back in, but uses the old version from before 4d8f0a58, with the assignment to addrLocal inside the condition:

    if (pfrom->fInbound && addrMe.IsRoutable())
    {
        pfrom->addrLocal = addrMe;
        SeenLocal(addrMe);
    }

That means that for outbound connections we don't have a record of what the peer thinks our IP address is, and so IsPeerAddrLocalGood(pnode) in AdvertizeLocal() fails, and so we never get to advertize our IP address, it never gets onto the network, and so nobody attempts to make a connection to us.

The simple fix is to move the:

        pfrom->addrLocal = addrMe;

up 3 lines.

But a more serious problem is understanding how these kinds of errors happen. How are big chucks of code changing, disappearing, then changing back to how they started?