Auto IP function duplicates IP address

shaxxx commented 11 months ago

I have my supernode nice and running on Windows, with about 100 communities in community list. Each community has subnet defined in community list, ie.

example1 10.0.10.0/24 #EXAMPLE 1
example2 10.0.10.0/24 #EXAMPLE 2

Each community has one server available 24/7, on predefined IP address set by edge client (ie. 10.0.10.1) Other clients are dynamic, and don't have IP specified - they get it from supernode in range specified in community.list Each edge sends it's machine name with -I option. It's all was working perfectly, until recent problem when one of edge clients, let's call it XXX, in one specific community would get IP from supernode (ie. 10.0.10.15) but connection would not work and ipconfig /all would show problematic IP address marked as DUPLICATE. Looking at output from supernode managment port, there is no clients with that IP. If I keep disconnecting and connecting, edge client would eventually get some other IP and everything would work fine. Currently I worked around it by randomizing information sent with -I option.

Now, this is where things get really interesting. On this same subnet/community I have another client, lets call it YYY that usually gets IP allocated really close to the one from problematic client (IE. 10.0.10.16 or 10.0.10.17). And I believe this is the origin of the problem. Whenever I check managment port it shows YYY connected, but NOT with actual IP address used in community subnet. For example, not it shows 10.0.10.17, while actual address on TAP adapter of YYY is 10.0.10.16. I can't ping 10.0.10.17 from any edge client, but everything is fine with 10.0.10.16 Keep in mind, YYY is connected every morning, has no issues at all, can work with server on 10.0.10.1 just fine all day. And I can see it on managment port connected all the time but with invalid IP address (usually with offset +/- of one).

Conclusion, there is list of potential problems

auto dhcp fails to check for existing IP address properly
auto dhcp checks for existing IP properly but existing IP list of edges is not valid for some reason
managment is providing false data
I'm an idiot ?

Logan007 commented 11 months ago

Hey, thank you for describing the issue you discovered. Do you only use one supernode or do you use a federated supernode setup?

We do not use real DHCP but some other simple algorithm. As you pointed to the name string, that's exactly what we hash into an IP address, and go up/downwards in case it is already taken. This does not work in a distributed way yet.

shaxxx commented 11 months ago

No, I'm using single supernode, but with multiple communities, and like I said every community defines it's subnet in community.list and this is the first one with the problem (more than 100 communities).

Yeah, I did went trough PRs and code to see it's Pearson hash, but didn't really had any idea how to properly debug it since this runs in production and I can't simply take it offline for the sake of debugging. Also, I've tried verbose debug output on the client, but couldn't find anything that sticks out of normal.

If you need more info let me know.

Logan007 commented 11 months ago

I currently am a bit far away from the code, so I am guessing here until I get another chance to take a deeper look. But, what comes to mind is MAC address. As far as I remember, MAC address is assigned randomly if not specified by -m. Maybe some other computer remembers same IP address having previous MAC and communicates a duplicate or so? (Thinking of some ARP issue, but again, not sure here).

So, one thing I would try for testing is a fixed MAC address using -m at that edge.

Logan007 commented 11 months ago

Note to myself: also check the code for gratuitous ARP sent by edges with auto-IP on.

shaxxx commented 11 months ago

I will try with fixed mac address, but keep in mind I already tried removing all TAP adapters multiple times, as well as reinstalling TAP driver which changed MAC address of used adapter multiple times. Also I was so desperate and restarted (shutodown, then start) supernode once, and it helped, but only until the next morning. My supernode conf is

-p=7654
-c community.list
-F cloud
-M

shaxxx commented 11 months ago

I don't think it's ARP problem. Edge client is connected for almost entire day, problem is it get's one address from supernode (10.0.20.16) while in the same time management port reports it's using another one (10.0.20.17). There is no 10.0.20.16 in management port info, while it says 10.0.20.17 is connected. If I kill the edge management port will stop listing 10.0.20.17 as online (time will not be updated to current anymore). If I reconnect, it's the same thing, gets one address, reports the other in the management port. Here's the non verbose log from client

24/Jul/2023 06:12:36 [edge_utils.c:3236] adding supernode = n2n.XXXX.XXX:7654 24/Jul/2023 06:12:36 [edge.c:1029] WARNING: switching to AES as key was provided 24/Jul/2023 06:12:36 [edge.c:1069] starting n2n edge FIXME Tue Jun 6 20:43:42 CEDT 2023 24/Jul/2023 06:12:36 [edge.c:1075] using compression: none. 24/Jul/2023 06:12:36 [edge.c:1076] using AES cipher. 24/Jul/2023 06:12:36 [edge_utils.c:392] number of supernodes in the list: 1 24/Jul/2023 06:12:36 [edge_utils.c:394] supernode 0 => n2n.XXXX.XXX:7654 24/Jul/2023 06:12:36 [edge_utils.c:483] successfully created resolver thread 24/Jul/2023 06:12:36 [edge.c:1106] automatically assign IP address by supernode 24/Jul/2023 06:12:36 [edge.c:1178] send REGISTER_SUPER to supernode [n2n.XXXX.XXX:7654] asking for IP address 24/Jul/2023 06:12:36 [edge.c:1189] received REGISTER_SUPER_ACK from supernode for IP address asignment Open device [name={48A2FBDB-162F-4DA7-AA22-2F78A96E910A}][ip=10.0.20.16][ifName=OpenVPN][MTU=1290][mac=00:FF:48:A2:FB:DB] 24/Jul/2023 06:12:36 [edge.c:1215] created local tap device IP: 10.0.20.16, Mask: 255.255.255.0, MAC: 00:FF:48:A2:FB:DB 24/Jul/2023 06:12:36 [edge.c:1308] edge started 24/Jul/2023 06:12:36 [edge_utils.c:1169] successfully joined multicast group 224.0.0.68:1968 24/Jul/2023 06:12:37 [edge_utils.c:2569] [OK] edge <<< ================ >>> supernode

Logan007 commented 11 months ago

Okay, thanks for the report then. I will have a look when I get to it, more likely weeks than days. Did I understand correctly that it normally would be assigned the 15, and 16 already is the counted-up address?

shaxxx commented 11 months ago

I think it's actually count down address. Had time to bring up local supernode with same community.list and two edge nodes from virtual machine with same information used in -I option. I was able to reproduce that pearson hash for both client is the same 10.0.20.17 (before counting down or up). But did not had time to debug any further. I can send you client info strings if you want to reproduce it localy to test.

shaxxx commented 11 months ago

Good news, I was able to get 100% reproducibility on local machine. Steps to reproduce:

run supernode with supernode.conf and community.list attached in zip
run 1st edge (let's call it ZUPKA-KLIJENT) with zupka-klijent.conf attached in zip, edge will get IP 10.0.20.17 from supernode
run 2nd edge (let's call it SERVER2022) with server2022.conf attached in zip. Pearson hash on supernode will calculate 17 as first proposal, but ip_addr_available function will detect existing edge with same IP, so after going one downwards 2nd edge will get IP 10.0.20.16. All good so far.
kill 1st edge (ZUPKA-KLIJENT) with double CTRL+C
run 1st edge (ZUPKA-KLIJENT) again. Supernode will detect existing 10.0.20.17 and 10.0.20.16 edge clients, and after going downwards two times from original Pearson hash of 17, 1st edge (ZUPKA-KLIJENT) will get 10.0.20.15
list connected edges from supernode management port, you will see two edge clients, SERVER2022 with IP address 10.0.20.16 (correct), and ZUPKA-KLIJENT with IP address 10.0.20.17. (INCORRECT)
kill 2nd edge (SERVER2022) with double CTRL+C
run 2nd edge (SERVER2022) again. Supernode will detect existing 10.20.0.17 and 10.0.20.16 edge clients, but will not detect 10.0.20.15. So after after going downwards two times from original Pearson hash of 17, 2nd edge (SERVER2022) will get also 10.0.20.15 IP address and windows ipconfig will mark it correctly as duplicate.
leave both edges and supernode running for few minutes and list connected edges from management port. You'll notice that ZUPKA-KLIJENT is still online, but reports invalid IP address 10.0.20.17 instead of actual working 10.0.20.15

Conclusion: supernode does not takes in account changes in IP address in the list of connected edges which leads to wrong information on managment port and duplicated IP's on edge clients.

You will also find verbose logs from supernode and both edge clients (first and second start) attached in the zip file.

debug.zip

duplicate

Logan007 commented 11 months ago

Thank you very much for the detailed description! It will be an extremely good starting point for debugging. It could be related to the edge not properly de-registering (does it work better when you wait with restarting until the edge data is purged at supernode, around 90 to120 seconds after you quit the edge?) but also definitely includes an additional underlying bug.

I will take care of it as soon as I can, not before in three weeks though.

shaxxx commented 11 months ago

Currently when existing edge is matched by mac address there is additional check if socket has change, and if it has, then updates existing edge with new socket info.

Following this logic, I think it's safe in this case to update it's device address also. I added one line after socket update

memcpy(&(scan->dev_addr), &(reg->dev_addr), sizeof(n2n_ip_subnet_t));

I tested again, no duplicate IP's, and management port displays valid info. But you'll know better if my logic is reasonable enough for other use cases.

Logan007 commented 11 months ago

Looks extremely good to me, the update of the IP address will help auto-logic to always have an up-to-date list of actually occupied IP addresses. I currently do not remember how we transmit and handle not-automatically assigned IP addresses I and hope it does not conflict, but yes, your approach looks extremely reasonable. Only thing, not sure if it needs some checks before copying.

We obviously never thought of this case (IP address of node changed) in the update path. Good catch!

Logan007 commented 11 months ago

Do you want to test for a while and then provide a pull request?

shaxxx commented 11 months ago

I just tried to pull the latest changes to make PR and it seems this was fixed by you 2 years ago. Thought I was going crazy, I forked the repo on April 27th, and there is no IP address copy part in my fork But in current, repo, there it is, just like it's supposed to be. Not sure how this is possible, Git blame says it's2 years old.

Logan007 commented 11 months ago

So the bug has already been fixed? Does that revised version work for you?

shaxxx commented 11 months ago

Nah, I was looking at the wrong place, IP address info was updated in the part of code run when new edge is connected. I've simply copied that part to the place when socket change was detected. I'll push it to production tonight, and give it a week or so to see if there's any problems, and then come back to you. Thank you for your support!

Logan007 commented 11 months ago

Thank you for actively contributing to development!

hamishcoleman commented 11 months ago

Thanks from me too.

@Logan007 - should we backport this bugfix to the stable branch?

Logan007 commented 11 months ago

Sounds reasonable, as far as I can see it does not break compatibility.

shaxxx commented 10 months ago

Fixed with 1142

ntop / n2n

Auto IP function duplicates IP address #1139