ntop / n2n

Peer-to-peer VPN
GNU General Public License v3.0
6.05k stars 925 forks source link

Edge not purging disconnected peers. #1146

Closed sduensin closed 6 months ago

sduensin commented 10 months ago

My test setup:

Starting and stopping the edge on "remote2" several times results in this from the "winchesterhouse" edge management console:

image

Every time I restart edge on "remote2" I end up with another entry.

What am I doing wrong?

hamishcoleman commented 10 months ago

The old entries should be purged. However, I see that the oldest entry on your list is 80 seconds already - did you happen to check again after this? (I cannot remember what the defined expiry in the code is, but it should happen within 2 minutes)

sduensin commented 10 months ago

That's just a quick/short example to show what's going on. There's also this one:

image

(And, no, I'm not an egotistical Linux God. The machine is named "scott-linux-godot".)

hamishcoleman commented 10 months ago

Yeah, that's definitely unexpected!

From your original screenshot, it looks like you are running v3.1.1, but that version number is a little unexpected as the dev branch has moved on from that point. Can you confirm what version you are running and how you compiled it?

Can you share the config you are using for both the edge and supernode?

With all that in hand, we might be able to reproduce the issue

sduensin commented 10 months ago

I'm trying to do some experiments now to see exactly what I've compiled and where it went sideways. I think I compiled the head of 3.1.1 from a week ago, but dumb me didn't keep the git metadata around so I don't know which commit it was. I'm going to re-pull and try a few different versions to see what is really going on.

The only thing different that I did that probably isn't recommended is my build only includes pthreads - no other external libraries.

hamishcoleman commented 10 months ago

Strange, simply compiling a git checkout of head should have resulted in more details showing in the version number shown (eg "3.1.1-235-gcd7b3b")

pthreads is not "not recommended" - just less of the autobuilds use it. I also dont believe adding threads should touch the code that does the entry purging either..

sduensin commented 10 months ago

I just did this:

mkdir n2nwork
cd n2nwork
git clone https://github.com/ntop/n2n.git
cd n2n
./autogen.sh
./configure --enable-pthread
make

The resulting binaries are still behaving as they were. Neither of these edge instances are currently running:

image

I'm using Linux Mint 21.2 and am confused. What else can I tell you?

hamishcoleman commented 10 months ago

Well, I think I can reproduce your results - so that gives us something to start with.

In my network, I dont tend to have edges restart often, so while I have not seen this behavior, maybe it has been happening previously.

I'll use this reproducible setup as a starting point for more investigations

sduensin commented 10 months ago

Also, so you know why I'm starting and stopping edges like a fool... I'm building a Hamachi-like application.

hamishcoleman commented 10 months ago

You know, it never occured to me to question why you were starting and stopping edges .. I think I just said "oh, its a large network"

On my test case, I am seeing the internal data structures are marked as not purgeable. Can you run ./scripts/n2n-ctl edges --raw and see if you have any entries other than ones that are "purgeable: 0"

hamishcoleman commented 10 months ago

I've been able to resolve my simple test case. Can you test the branch in #1147 against your config?

What I found looks to be an uninitialised structure member. A recent patch to start using Bool types inverted the true/false meaning and made the lack of initialisation significantly more visible in normal use.

sduensin commented 10 months ago

Yes! It's purging! It seems to take 80-90 seconds before it happens. I looked in the source and can't quite determine where this is coming from. Would it be a "Bad Idea (tm)" to reduce it?

Oh, and THANK YOU! :smile:

hamishcoleman commented 10 months ago

Good news. Keep running that one for a couple of days and report any issues you see. I'll merge it in a couple of days if nothing shows up.

As I understand it, the purging is a memory vs cpu cost tradeoff. Its basically just tuning the cache behaviour.

You should be able to tweak include/n2n_define.h with the PURGE_REGISTRATION_FREQUENCY and REGISTRATION_TIMEOUT definitions

Logan007 commented 6 months ago

As far as I remember, delayed purging indeed shall ensure overall lower CPU load by also tolerantly bridging any gaps caused by packet loss (we are talking UDP) or other network hick-ups.

Considered solved, closing.