sudomesh / sudowrt-firmware

Scripts to build the sudo mesh OpenWRT firmware.
Other
73 stars 19 forks source link

Handle roaming clients #57

Closed Juul closed 6 years ago

Juul commented 9 years ago

By roaming clients i mean clients (e.g. smartphones) that move between different peoplesopen.net access points.

Since we're now using a layer 3 routing protocol we don't get free roaming ;'( but on the other hand we don't have the MAC address tracking problem that we had with batman-adv (clients could be tracked by their MAC across the entire network) so \o/

I'm fairly sure (though I haven't yet checked) that a client does not emit any DHCP packets when it moves from one access point to another with the same SSID. This means that we'll have clients with an IP from one node's /26 that are associated to node with a different /26. If these clients transmit packets then the packets will reach their destination (barring firewall rules) but responses will be routed to the wrong node.

The first thing we can do to mitigate this situation is to set a low DHCP lease time. Maybe as low as 1 minute. That way roaming clients will only stop working for maximum 1 minute when they roam.

The real solution will be to implement a daemon that detects when clients roam to a new node and tells babeld to announce a /32 for that node. Since the /32 is more specific than the /26 announced by the lease-giving node this should work as expected. When the client deauthenticates from its node (which it does by sending deauthentication packets) the daemon should tell babeld to retract the announced /32. Since things can go wrong and we don't want to have a build-up of stale /32 routes announced by babeld, I suggest that we add a TTL to these /32 routes such that babeld will automatically retract them after the same amount of time as the DHCP lease time. This is acceptable since we know that the client will already have gotten a new IP in the /26 of its currently associated node when after the DHCP lease has expired.

One problem we'll have to tackle: The 802.11 authentication and deauthentication happens on layer 2 so the packets won't include the IP address (as far as I know) of the client but we need the IP in order to announce the /32 route with babeld. I'm not sure how best to deal with this. We might want to look at how existing proprietary solutions solve it. It's always possible to ping 0.0.0.0 and see who responds and what their MAC is, but that's not optimal since it means all associated clients have to respond to a ping request every time someone new authenticates.

To implement this we'll need:

Once we've implemented this solution we should think about having longer DHCP leases since any VOIP call or similar will break on lease expiration if the clients roamed since they got the lease. There is a solution that completely fixes this issue: Modify the DHCP server on the nodes to relay DHCP renew requests to the DHCP server on the node responsible for the relevant /26.

How important is this issue? Certainly not having roaming seems like a big deal, but wifi clients tend to be reluctant to switch to a new AP, even when the client is moved away from the original AP to a point where another AP has a better signal. This is likely done to prevent "flapping" between APs. In some cases it sucks but for a mesh without roaming support this is great since people who stay in the same physical area (e.g. a room) probably won't ever roam.

Since this "stickiness" is sometimes problematic for other reasons these standards were created to deal address the issue:

Juul commented 9 years ago

Hm. To find the IP of newly associated client when you only have the MAC address maybe it's possible to open a raw socket and construct an ICMP ping packet with layer 2 destination address set to the known MAC address and IP destination address set to the network subnet address.

(a few searches later)

Turns out someone already wrote a tool that does exactly this. It's Thomas Habets' arping. Beware that the arping utility included in debian-based distros is a completely different program. To get the functionality described above call arping like so:

arping -T <subnet_broadcast_address> <mac_address>

This is problematic since we don't even know the subnet of the client (we know the /10 subnet but not the actual subnet of the IP bound to its wifi interface). We may have to fall back on broadcast ping but it sounds like some operating systems (windows?) possibly do not respond correctly to pings to 0.0.0.0 (only to the correct subnet broadcast address (e.g. 192.168.1.255)). We should test this, but it is increasingly seeming like this won't work :(

There is also a completely different way of discovering the IP of the roaming node: Have all nodes communicate client 802.11 reassociation (roaming) events to all other nodes and at the same time ask "what is the IP of this MAC". When the node that received the reauth request gets an answer to "what is the IP of the roaming node" it can start announcing a babeld route. This seems to be similar to the strategy employed by proprietary layer 3 roaming solutions (except the don't mesh so it's more involved for them). Unfortunately this brings back the MAC tracking issue since all roams are being multicast with the client MAC to whoever cares to listen. We can maybe mitigate this with salted hashes of the MACs but they're only 48 bits and even less if you can guess the vendor so brute-forcing the hash unfortunately may be feasible. We should look into ways of mitigating this. It might make sense to add the re-association message relaying functionality to babeld itself (hopefully in some modular fashion), since we then don't have yet another multicast system that needs reflecting/forwarding.

sigh mesh is hard

max-b commented 9 years ago

So one issue with adding arbitrary /32 routes to home nodes is that babeld doesn't currently support runtime configuration (except for the mods that you've made and could probably make) AND that there's no real expiration parameter, we'd have to add extra functionality to our daemon which would keep track of when DHCP leases are going to expire (or listen to DHCP NACK events) and would remove them the home nodes babel configuration.

Ideally, the daemon would listen to 802.11 authentication events, find the IP with that fancy arping tool, and then send a FORCERENEW frame, which would ask the client to get a new lease that would be on the proper subnet. Unfortunately, it looks like Windows 7/8 does not support this: https://social.technet.microsoft.com/Forums/windowsserver/en-US/1bb50932-29bc-4446-a1a4-081037207f36/dhcp-forcerenew-message?forum=winserverNIS and looking through dhclient's source I don't think they support it either, so it's probably a non-starter.

Another possibility is that we use the arping to determine whether the newly associated client is on the proper subnet and if it isn't, we send a 802.11 deauth packet. Who knows how any particular client will respond to the deauth - it's possible that might produce a shitty user experience/interfere with roaming. I was digging through network-manager and wpa_supplicant and I couldn't get an entirely clear read on how that might be handled.

Another thought - instead of arping, couldn't we listen for packets on the open interface which have a src address that isn't on the correct subnet? We'd still need to decide how to tell the client to get a new lease, but that seems simpler than the arping idea....

max-b commented 8 years ago

jerkey has mentioned that this is an issue for him:

15:49 < jerkey> when i walk with android from upstairs tmp node to the one in juul's room, phone still thinks it's connected but cant reach the internet 15:50 < jerkey> err_host_unreachable until i turn off wifi and turn back on 22:45 < jerkey> max-b: if you don't go with babeld/batman-adv, i think you should make the nodes automatically push a new lease, rather than have the user simply experience a dead connection and have to restart networking. 23:41 < jerkey> it's really annoying, i had to disable peoplesopen.net on my phone because otherwise textsecure didn't work 23:42 < jerkey> even if it's annoying hack you should still do it

max-b commented 8 years ago

Ok so we might have a decent solution to this from Mitar's Cloyne config: https://github.com/cloyne/network/tree/master/unifi

In particular this section:

#!/bin/sh

. /lib/functions.sh
. /lib/functions/network.sh

# Disable legacy 2.4GHz low bitrates
if [ ifup = "$ACTION" ]; then
    case "$DEVICE" in
        wlan*)
            logger setting bitrate for device "$DEVICE" on interface "$INTERFACE"
            iw "$DEVICE" set bitrates legacy-2.4 6 9 12 18 24 36 48 54
        ;;
        br-*)
            # Bridged interface, check if any wifi interface is member
            for i in $(ls /sys/class/net/$DEVICE/brif); do
                case "$i" in
                    wlan*)
                        logger setting bitrate for device "$i" on interface "$INTERFACE"
                        iw "$i" set bitrates legacy-2.4 6 9 12 18 24 36 48 54
                    ;;
                esac
            done
        ;;
    esac
fi
max-b commented 8 years ago

Ok so I think that https://github.com/sudomesh/makenode/commit/d173878d1b3674ff5a76631ea68fe6a7b923a956 will improve roaming at least for home nodes.

We need to figure out where to put the stanzas for extender nodes (not to mention potentially try to re-write our meshrouting script to be more modular...)

Juul commented 8 years ago

Well this solves a different problem. I believe this will make it more likely that clients associate with a different AP when moving around (and also drops support for 802.11b) but it won't make them get a new lease, so data will still be routed wrong.

The easy and hacky solution for right now would be to make every peoplesopen.net access point have a slightly different SSID ("peoplesopen.net hearth1", "peoplesopen.net sudo", etc.) meaning that users would have to manually associate with each one. This makes sense: If roaming isn't actually supported then we shouldn't encourage clients to auto-roam between nodes. Maybe I should move roaming support to the top of my mesh todo list. It is a very interesting problem.

max-b commented 8 years ago

Ah - right of course. The drawback is that if you go to a new location, you have to connect to a new "peoplesopen.net xyz" SSID. That's more of a general org/logistics question.

I'm not sure that you need to make this the top of your mesh todo list. Let's clear up the remaining v0.2 milestones. At the very least, position this issue somewhere within the tickets/issues hierarchy you created eh?

Juul commented 8 years ago

Hm yeah it's already set to 0.4 but maybe I should move it up.

max-b commented 8 years ago

Yeah I think its fine either way. Looking at what we've ticketed for 0.3 and 0.4, there's plenty to do :D

Juul commented 8 years ago

These are my early thoughts on a layer 3 roaming solution that does not rely on centralization for roaming coordination and does not use tunnels. There are two variations. One using one using additional babel route announcements and one using layer 3 NAT.

The solutions are identical until step 9. Here are the first 8 steps:

  1. client with IP 1.22/26 and MAC c22 is associated to mesh node with IP 1.1/26 and MAC a01 and has a DHCP lease for 1.22 that expires at time 42. Client has its default gateway set to 1.1
  2. client disassociates from node a01 (first part of roam)
  3. node a01 remembers information about client for some short amount of time. dhcp lease for 1.22 remains in effect until time 42.
  4. client associates with node a02 which has IP 2.1/26
  5. node a02 sends multicast message with a TTL of 3 asking "did a client with MAC c22 recently disassociate from any nearby mesh nodes?"
  6. node a01 responds to node a01 saying "yes client c22 just left me. it has the IP 1.22/26 and expects a default gateway with IP 1.1/26 and MAC a01 and it has a DHCP lease that expires at time 42"
  7. node a02 enters promiscious mode and sets up the ebtables rule: "For any traffic from a02 (myself) to c22 change source MAC to a01"
  8. node a02 sets up an arptables rule: "answer ARP requests from c22 for 1.22 with answer a02"

Here is the variant that uses an additional babel route:

9, node a02 starts announcing a new babel route saying "send traffic bound for 1.2/32 to me" which expires at time 42

  1. client c22 sends a packet to 8.8.8.8 and the packet has destination MAC a01 (since a01 is the MAC of the client's configured default gateway so it will have an entry for it in its ARP table)
  2. node a02 receives packet p1 because it is in promiscuous mode and forwards the packet according to its routing table
  3. someone, somewhere sends a packet destined for 1.22
  4. the packet is routed to node a02 with IP 2.1/26 beacuse its announced 1.22/32 route is more specific than the 1.1/26 route announced by node a01
  5. node a02 forwards the packet to client c22
  6. client roams to another node
  7. node a02 detects the disassociation and tears down all c22-specific ebtables and arptables rules and retracts the 1.2/32 babel route.

This solution is probably superior to the solution below since it has no side-effects that I can see.

Here is the layer 3 NAT variant:

9, node a02 asks its own DHCP server for a lease expiring at time 42. It receives a lease for 2.22.

  1. node a02 sets up iptables rules: "layer 3 NAT traffic from 1.22 to appear to come from 2.22".
  2. client c22 sends a packet to 8.8.8.8 and the packet has destination MAC a01 (since a01 is the MAC of the client's configured default gateway so it will have an entry for it in its ARP table)
  3. node a02 receives packet p1 because it is in promiscuous mode, it changes the source IP from 1.22 to 2.22 and forwards the packet according to its routing table
  4. someone, somewhere sends a packet destined for 2.22
  5. the packet is routed to node a02 with IP 2.1/26
  6. node 2.1 changes the destination IP to 1.22 and forwards the packet to client c22
  7. client c22 roams to another node
  8. node a02 detects disassociation and tears down all ebtables, arptables and iptables rules and releases the DHCP lease.

The only thing I can see breaking from this is when some P2P apps (skype, torrent clients, etc.) check which local IPs are available and then use layer 7 protocols to tell others about these IPs. The client will think it has IP 1.22 but really no-one will be able to reach it at that IP. I believe this will break mesh to mesh web-rtc :(

papazoga commented 8 years ago

Interesting. In step 7, presumably you mean "change source MAC to a01". I'm averse to the layer 2 NAT and the promiscuous mode, so here's an alternative that may work without it. Also, I don't think it breaks the P2P apps because the client address won't be NATted.

A conventional solution used when you have multiple APs with the same SSID is to give them all the same IP address and the fix the MAC-layer addressing issue with "gratuitous ARPs", sent by the new AP upon association. This says in effect "the AP now has MAC addess XXX". This apparently works. I'm not sure if Linux does this on it's own, or if we have to send the packets ourselves, but that's not a huge issue.

So an alternative would be: configure all AP interfaces to the same fake "gateway address," which is the NATted to the node's normal address. This would also mean that all clients would probably assigned addresses from the same "DHCP subnet" which could then be divided up to the DHCP servers on each node, or just served by (a) central DHCP server(s).

To get the MAC addresses changed on the client side, issue gratuitous ARPs from the new base station as soon as association takes place. Then the problem for the AP is how to figure out the IP of the new client in order to distribute the new route.

There's a chance that some clients will issue gratuitous ARPs when they associate. This would be conservative good practice, and I've seen them when, e.g. a Mac client associates to a new SSID. I'm not certain they occur when associating to a new base station with the same SSID, so I'll run an experiment. This would be one way.

A second way (fallback?) would be to broadcast ping on the DHCP-allocated subnet (with broadcast MAC). This is how arping works. I don't know if Windows/Mac/Phablet clients respond to these, but this might work ok if we combine it with the previous (ARP-based) method.

This will give the nodes internet connectivity on the assumption that just adding the route to babel magically works, meaning that the stale route to the old AP doesn't interfere destructively. Otherwise there is the problem of erasing the stale route, which will require some combination of (1) some form of inter-node communication (babel already provides a means for this: the new route is distributed, can't the old AP use that information to erase the old one?), (2) dropped frame/disassociation detection, and as juul discussed above (3) TTLs for the /32 babel routes.

There is also a second problem, which we can put off thinking about until later, since it will only improve quality of service: if you want a seamless handover, you will need to establish a (most likely GRE) tunnel to the old gateway. How do you discover the address of the old gateway? Again, some form of inter-node communication is probably necessary, though maybe just using babel's routing data is enough (unlikely).

I think promiscuous mode should be avoided on the AP. Turning it on would mean all frames end up in the kernel netfilter. The nodes (even new ones) have limited CPU capacity.

papazoga commented 8 years ago

Didn't mean to close, and apparently it's irreversible :-(.

Juul commented 8 years ago

Interesting. In step 7, presumably you mean "change source MAC to a01".

Yep. Fixed.

A conventional solution used when you have multiple APs with the same SSID is to give them all the same IP address and the fix the MAC-layer addressing issue with "gratuitous ARPs", sent by the new AP upon association.

Really?! If this works then this negates the need for both promiscuous mode and layer 2 NAT in my proposed solution! I then don't see why we need to have the access points all have the same IP?

Regarding what you're saying about discovering the IP of associating clients, I'm totally up for running some experiments but we're definitely in "here be dragons" territory. There is nothing in any widely implemented standard that I am aware of that allows for this. All of my research on the subject has come up with solutions that only work for a subset of operating systems. If we can find a way to do this then that would be very useful. Actually, if we find a solution for this that does not require communication with other nodes and the gratuitous ARP solution works as well then implementing this roaming system suddenly becomes much simpler than I expected.

papazoga commented 8 years ago

You would want all access points to appear to have the same IP so that clients don't need to figure out their new gateway address (which they can't, without DHCP). This way when they receive the gratuitous ARP they just change the MAC address associated to the "fake gateway" IP to reflect the new data and proceed as before.

Discovering the IP address of the new client will require testing. But assuming the client starts treating the new AP as its gateway after association, the AP will (one way or the other, even if the client doesn't gratuitous ARP, and even if we can't broadcast ping it) receive packets from the client's IP with the client's MAC address and so the association is possible. It's a matter of how hard the implementation will be (e.g. do we need another daemon, what to listen for, etc).

I did some googling, and these people seem to have done almost exactly what I'm suggesting (but I only skimmed): https://www.usenix.org/legacy/events/mobisys06/full_papers/p83-amir.pdf

Juul commented 8 years ago

You would want all access points to appear to have the same IP so that clients don't need to figure out their new gateway address (which they can't, without DHCP). This way when they receive the gratuitous ARP they just change the MAC address associated to the "fake gateway" IP to reflect the new data and proceed as before.

Yes, but it doesn't matter if they think the gateway is 1.1 even though it's really 2.1 as long as they send to the correct MAC. The default gateway is a layer 2 thing. Setting a default gateway just means "if this packet is destined for outside of your subnet and you don't know anything else about how to route this packet, then send it to the MAC that belongs to this IP". There is no special "via" field in the IP header, there is just source and destination.

So if a client believes that its default gateway is 1.1 but it is actually sending sending to 2.1 (due to gratuitous ARP packets) then the layer 3 packets going between the client and the gateway will be 100% identical to what they would have been if it was actually talking to 2.1 or had actually configured 2.1 as its gateway.

The only thing that becomes weird in this scenario is broadcast packets. Broadcast packets from the client will be received at layer 2 by everyone. If the broadcast packets are using 0.0.0.0 as their source and 255.255.255.255 as their destination then they will work correctly. If they on the other hand are using the client's IP 1.22 as the source and 1.63 as the destination (the broadcast address of the /26) then we have a problem. Personally I think maintaining broadcast functionality during a roam is kind of an unnecessary luxury.

If we really want to have broadcast working then we need to decide whether it makes more sense to have broadcasts routed between the client and its previous subnet or whether it should work for its current subnet. If we were concerned about breaking an existing connection then we'd want to have broadcasts routed to a client's previous subnet. That would require a tunnel (with some filtering to ensure that DHCP doesn't make it through). However, realistically there aren't really any protocols that use broadcasts packet in a way where you'd be "breaking a connection" (and if there are then shame on them). Mostly they are used for announcing and discovering things that are specific to a subnet. If we wanted to maintain that kind of functionality while roaming then it would be enough for the client's current node to layer 3 NAT broadcast packets from the client to appear to come from the gateway itself. That would mean that all broadcasts from a roaming client got broadcast twice and all replies would be sent twice as well. Given the nature of most broadcast traffic I would think that such a solution is acceptable. Though again I don't think it's necessary to support broadcast while roaming.

With regards to discovering client IP, it really isn't that big a deal to send a multicast packet with a TTL of e.g. 3 saying "hey any nodes who have a DHCP lease for this MAC?". I'm also very curious how you'd discover a roaming event without talking to other access points (how do you know this wasn't the client, and how would you retract the /32 that was broadcast by the client's previous node when the client roams more than one node away? Another nice feature from asking nearby nodes is that is gives us the lease expiration time, which is useful for setting an expiration time on the /32 babel route for the client.

I am confused about your solution. You say:

So an alternative would be: configure all AP interfaces to the same fake "gateway address," which is the NATted to the node's normal address. This would also mean that all clients would probably assigned addresses from the same "DHCP subnet" which could then be divided up to the DHCP servers on each node, or just served by (a) central DHCP server(s).

That doesn't make any sense to me. If you're giving all nodes the same IP on the open0 interface (where the clients connect), let's say 100.64.42.42, and you still want the nodes to give out mesh-unique addresses to each node, then how do the subnets work out?

Example 1:

Example 2:

This may be what you're getting at but I'm not sure:

Advantages:

Disadvantages:

I'd be much less excited about a mesh where clients are NATed, and would probably prefer to drop roaming support over having to NAT all clients.

I haven't yet read the paper you linked but will get on it.

papazoga commented 8 years ago

There is no special "via" field in the IP header, there is just source and destination.

Yes, I do have a basic understanding of IP, and wasn't suggesting anything involving a "via" field.

There's two reasonable ways to do this proxy ARP thing I'm suggesting. In one, all the gateways have the same IP address and they all impersonate one "fake" gateway. In the other they all have different IP addresses, and they all have to impersonate each other. There's a trade-off in some of the complexities, but they're similarly complex as a whole.

You appear to be suggesting the second way. The new AP uses gratuitous ARP to impersonate the old APs address (a form of proxy ARP). That's certainly workable, and perhaps easier to transition to from our present configuration. But there's something that won't be as simple to deal with: the destination AP needs to know the IP of the source AP in order to spoof it (or it can spoof all of them, which is just a lazy kludge).

Also in that approach, the client will lose its DHCP lease after roaming, and rather quickly. So its address will be more temporary. There's a third problem which isn't as obvious: the node will have to figure out the roaming client's IP address. To do that, the best way is probably ARP. But how will it know to ARP the client on its br-open if its configured on the wrong subnet? This isn't insurmountable, but the other approach (where all clients are on the same subnet to begin with) gives a neater solution, in my opinion.

Any way we go, we'll need some form of inter-node communication (multicast or babel-extension). It's a matter of coming up with one that is robust enough.

To set the record straight about what I'm proposing:

This may be what you're getting at but I'm not sure:

  • Nodes would no longer each have a /26 they would all have a single /32 address

Not quite. They would all operate on a single DHCP-allocated subnet. The routing would all be done with /32 prefixes though.

  • Nodes would announce their /32 using babel

Well, the nodes would announce /32 prefixes for the clients, if that's what you mean.

  • Nodes would have e.g. 172.28.1.1/24 on their open0 interface where clients connect and would hand out IPs to clients from this space

Absolutely not. In fact no NAT is required (I thought I'd have to NAT the node's "fake gateway" address, but actually that's not necessary). Nodes would be assigned an address and then keep it as they roam about.

  • Clients would be NATed to appear to come from the nodes /32

Nope. The client addresses would be mesh addressable.

  • Clients would just need to have their ARP table entry for that IP updated when roaming

No. See above.

  • When a roam is detected a /32 for the client gets announced by babel

Yes.

Juul commented 8 years ago

Ok. I still don't understand what would be assigned to a client.

The clients would get a mesh-routable IP but what would they get as their subnet? If their default gateway is always the same IP then they must always be on the same subnet as the default gateway. So if you give clients a /26 subnet then you must put all clients on the same /26 (if you don't then their default gateway will not be on their subnet) and that means you run out of IPs fast and clients won't be able to talk to each-other since they will think all other clients are layer-2-accessible. If you give them a /32 then they won't be able to reach the default gateway since they will have no subnet. If you give all clients a /10 subnet then they will expect the entire mesh to be layer-2-local meaning that they won't even try to go through the default gateway in order to reach other nodes, they'll just sit there sending ARP requests for nodes that aren't on their LAN.

Am confuse.

papazoga commented 8 years ago

If you give all clients a /10 subnet then they will expect the entire mesh to be layer-2-local meaning that they won't even try to go through the default gateway in order to reach other nodes, they'll just sit there sending ARP requests for nodes that aren't on their LAN.

Not if you proxy ARP the /10 on br-open.

Juul commented 8 years ago

Not if you proxy ARP the /10 on br-open.

Haha! Weird but I don't see why it wouldn't work. What about ARP replies from clients that are actually on the LAN? I guess you can ensure that clients of the same AP can't see each-other on layer 2 at all, at least for wifi. Not possible for ethernet though but it's probably not a big issue. Biggest issue I can see is that for ethernet-connected nodes some operating systems might detect this as a spoofing attack and be like "oh noes this network is unsafe!" but there are probably workarounds.

papazoga commented 8 years ago

Thought about your DHCP question a bit more [which we discussed in person Tuesday].

You were worried that under the common-gateway scenario (gateway addresses are identical) DHCP leases wouldn't get renewed properly. This wouldn't be the case. The DHCP renewal is done via unicast to the server from which the lease was obtained. So it should be routed back to the old AP as usual, as long as the DHCP server operates with the APs mesh routable address, which is no problem.

I also had a quibble about the many-gateway scenario (each gateway administers a separate subnet). In that scenario, when client roams, the new AP will not be able to properly ARP its client until a route (a /32) to it has been installed, which will require inter-node communication. This seems less robust to me; in the common-gateway scenario, the routing can installed with no communication and only the removal of the stale route requires communication.

In fact - the roaming protocol can (as a first stab) be announce-only.

Juul commented 8 years ago

You were worried that under the common-gateway scenario (gateway addresses are identical) DHCP leases wouldn't get renewed properly. This wouldn't be the case. The DHCP renewal is done via unicast to the server from which the lease was obtained.

Are you sure that all common major operating systems do this? It seems to me that DHCP is one of those specs that are only loosely followed. If this really works then we could keep the DHCP servers on the home nodes as it is now and just ensure that they give out the initial leases with a from address which is their publicly mesh-routable address, and then all subsequent renews will be routed correctly to the original DHCP server, even after the roam. This would work in both the scenario you've proposed and the one I've proposed.

I also had a quibble about the many-gateway scenario (each gateway administers a separate subnet). In that scenario, when client roams, the new AP will not be able to properly ARP its client until a route (a /32) to it has been installed,

Why would that be the case? I see that the new AP would have to see the wifi re-auth request (roam) and then ask anyone nearby "hey who is this MAC and what was their previous default gateway IP", get a response and then start sending ARP responses pretending to be the old AP. Is that what you mean? Given that several commercial solutions actually establish a full tunnel on each roam I don't expect this to be a problem, but we'd need to test this to ensure that it doesn't introduce annoying delays.

jhpoelen commented 6 years ago

We settled on using different SSID per home node. Please re-open if this issue still needs to be considered.