o11s / open80211s

open80211s
Other
242 stars 55 forks source link

mpath loop and air metric calculation #74

Closed zhejunli closed 5 years ago

zhejunli commented 6 years ago

Hello,

I am using open802.11s in ath9k chip. 2 things I have noticed:

  1. The spec mentions the Test Frames but I didn't find anywhere a test frame is sent. This will cause a problem. The problem is that the TX rate is not fresh when there's no traffic going this path. When this outdated Tx rate participates the ALM calculation, a wrong result is made.

  2. The SN counter looks can not avoid mpath loop. It happened with 4 nodes on by test bench. This paper explains this loop situation : https://arxiv.org/pdf/1512.08891.pdf.

Any ideas about these?

Thanks,

Jeff

twpedersen commented 6 years ago

for 1) you'll have to add your own test frame mechanism to keep rate control fresh.

2) In g) D should've incremented his SN prior to RERR? HWMP demands something similar. However, HWMP on-demand routing is also problematic because the PREQ forms a symmetric path back to originator, but the optimal RF path may be highly assymetrical. The only HWMP mode which is reliable is the passive PREQ (hwmp_rootmode = 2), which works similar to batman-adv's OGM.

zhejunli commented 6 years ago

@twpedersen : Thanks for the reply. For 1) I have added a mechanism to refresh the TX rate. For 2) we still prefer the on-demand over the proactive approach. So still struggling how to prevent the loop.

twpedersen commented 6 years ago

@zhejunli can you explain the loop in terms of HWMP?

zhejunli commented 6 years ago

@twpedersen :

I have 4 nodes ,n[1..4] and the topology is like PC1--->n1--->n2--->n3--->n4--->PC2. While pushing UDP data from PC1 to PC2, and adjusting the signal strength level between nodes, sometimes, the mpath table of n3 shows the next_hop is n4 while the n4's table showing the next_hop is n3. The data packets just bouncing back and forth between n3 and n4.

https://arxiv.org/pdf/1512.08891.pdf mentions about the same thing.

chunyeow commented 6 years ago

If the originator address is similar to the interface's address, the PREQ frame will be silently discarded and no routing info is updated.

As discussed in the paper on self entry as follow: "Node S also receives the forwarded RREQS->X message from node D, and before silently discarding the message (since it is the originator of the RREQ message), updates its routing table to create an entry to node D."

HWMP not allow the above to happen, so based on the results in table 2. It should be loop free.

When you see the data bouncing back and forth between n3 and n4, are you sure that the path link only established between n1 and n2, n2 and n3, n3 and n4?


Chun-Yeow

On Sat, Feb 17, 2018 at 5:25 AM, zhejunli notifications@github.com wrote:

@twpedersen https://github.com/twpedersen :

I have 4 nodes ,n[1..4] and the topology is like PC1--->n1--->n2--->n3--->n4--->PC2. While pushing UDP data from PC1 to PC1, and adjusting the signal strength level between nodes, sometimes, the mpath table of n3 shows the next_hop is n4 while the n4's table showing the next_hop is n3. The data packets just bouncing back and forth between n3 and n4.

https://arxiv.org/pdf/1512.08891.pdf mentions about the same thing.

— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub https://github.com/o11s/open80211s/issues/74#issuecomment-366362928, or mute the thread https://github.com/notifications/unsubscribe-auth/ABBewurNXi11UQPzsaYD5o6tBtjyVirKks5tVfJigaJpZM4SIu8J .

bcopeland commented 6 years ago

On Wed, Feb 21, 2018 at 01:09:56AM -0800, Chun-Yeow wrote:

When you see the data bouncing back and forth between n3 and n4, are you sure that the path link only established between n1 and n2, n2 and n3, n3 and n4?

It would be interesting to see a pcap file of the hwmp and data frames when this happens also.

-- Bob Copeland %% https://bobcopeland.com/

zhejunli commented 6 years ago

@chunyeow Thanks for the reply. I can't guarantee the path link only established between n1-n2, n2-n3 and n3-n4 like a chain. Actually I was changing the signal levels between nodes to simulate a random real-world situation. When signal levels are stable, the connection is good and stable. But when changing the "environment" to trigger a mpath change, it looks like the loop problem happens but not always.

To be more specific, I set up a chain link like PC1-->n1-->n2-->n3-->n4-->PC2 first and run iperf from PC1 to PC2. Meanwhile, I increase the attenuation between n1-n2 and n2-n3 but reduce the attenuation between n1-n3, hoping the mpath change to PC1-->n1-->n3-->n4-->PC2. Randomly the loop happens.

I have seen this happen in a 3 node system too. From the debug information it looked like a new path SN was trying to update the outdated mpath data structure with bigger SN and failed. Here the new path SN has the latest fresh mpath information while the old mpath information has old outdated mpath information but has a higher SN. So that the new SN couldn't update the old mpath.

It is hard to reproduce.

Jeff

Jeff

bcopeland commented 6 years ago

On Wed, Feb 21, 2018 at 10:26:46AM -0800, zhejunli wrote:

To be more specific, I set up a chain link like PC1-->n1-->n2-->n3-->n4-->PC2 first and run iperf from PC1 to PC2. Meanwhile, I increase the attenuation between n1-n2 and n2-n3 but reduce the attenuation between n1-n3, hoping the mpath change to PC1-->n1-->n3-->n4-->PC2. Randomly the loop happens.

I have seen this happen in a 3 node system too. From the debug information it looked like a new path SN was trying to update the outdated mpath data structure with bigger SN and failed. Here the new path SN has the latest fresh mpath information while the old mpath information has old outdated mpath information but has a higher SN. So that the new SN couldn't update the old mpath.

It is hard to reproduce.

This was with actual hardware right?

This seems like something we could do in wmediumd - I can give it a try when I get some time.

zhejunli commented 6 years ago

@bcopeland Yes this is with ath9k chip. The code base is from OpenWrt CC, kernel v3.10.36.

I don't know how OpenWrt synchronize with latest open80211s code base. Maybe the code in OpenWrt is too old?

Thanks.

bcopeland commented 6 years ago

On Fri, Feb 23, 2018 at 04:35:58PM +0000, zhejunli wrote:

@bcopeland Yes this is with ath9k chip. The code base is from OpenWrt CC, kernel v3.10.36.

I don't know how OpenWrt synchronize with latest open80211s code base. Maybe the code in OpenWrt is too old?

I think (but don't quote me, haven't looked recently) they use backports built from wireless-testing for mac80211 and wireless drivers. So those parts should be fairly up-to-date.

-- Bob Copeland %% https://bobcopeland.com/

chunyeow commented 6 years ago

As Bob pointed out, testing with wmediumd may be the way forward since it is very hard to reproduce in your environment.

By the way, can you check whether the following patch is available: https://www.mail-archive.com/devel@lists.open80211s.org/msg03106.html

On Sat, Feb 24, 2018 at 12:35 AM, zhejunli notifications@github.com wrote:

@bcopeland https://github.com/bcopeland Yes this is with ath9k chip. The code base is from OpenWrt CC, kernel v3.10.36.

I don't know how OpenWrt synchronize with latest open80211s code base. Maybe the code in OpenWrt is too old?

Thanks.

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/o11s/open80211s/issues/74#issuecomment-368062455, or mute the thread https://github.com/notifications/unsubscribe-auth/ABBewjhmr-ONY-opjmrtfAIvViUrykkPks5tXujvgaJpZM4SIu8J .

zhejunli commented 6 years ago

@chunyeow I confirm that patch is in place already.

zhejunli commented 6 years ago

The mpath loop didn't happen and just let it alone.

Another question about the air metric. Is it possible to get a fresh metric of a "potential" mpath that is not being used currently?

Example, 3 mesh nodes n1,n2 and n3 are all in scope with each other. There is a iperf traffic from n1 directly to n3, n2 being a by stander. In this case, the TX rate and PER of n1--->n3 is always fresh and updated. However, because there's no traffic from n1 to n2, so that the air metric of n1--->n2 is never updated. I added a test frame mechanism so that n1 sends test frames to n2 and n3 periodically in order to maintain a fresh air metric of n1--->n2. But I found the rate control can not give a correct/reasonable TX rate of n1--->n2 only by this small amount of test frame traffic.

During the iperf test, I increased the attenuation between n1 and n3 hoping to switch the mpath from n1---->n3 to n1--->n2--->n3. But I found this caused a wrong mpath switch decision. Because the air metric of n1--->n2 and n2--->n3 are not fresh because of no traffic or too little test frames traffic.

My questions is: Does 802.11s protocol do "better mpath selection" dynamically? Per my understanding it does. A mpath gets inactive periodically and a new mpath is formed. This mechanism makes sure the mpath is fresh and optimal. But if some potential mpathes' TX rate and PER were not updated properly before this time point, this new formed mpath will be wrong.

Thanks,

Jeff

bcopeland commented 6 years ago

On Mon, Mar 26, 2018 at 10:39:32AM -0700, zhejunli wrote:

My questions is: Does 802.11s protocol do "better mpath selection" dynamically? Per my understanding it does. A mpath gets inactive periodically and a new mpath is formed. This mechanism makes sure the mpath is fresh and optimal. But if some potential mpathes' TX rate and PER were not updated properly before this time point, this new formed mpath will be wrong.

It does, but airtime metric won't update significantly on the basis of just a few management frames that HWMP uses.

AFAIK this (estimating PER) is left to the implementation. You'll have to send a fair amount of data through the other nodes periodically to update the statistics tracked by the rate controller. Or come up with another estimator that doesn't rely on frame loss.

zhejunli commented 6 years ago

@bcopeland Thanks for the reply. For the rate control algorithm, it doesn't care the HWMP path selection packets and it only care the data packets I think.

So I added a mechanism to send some test DATA frames as a background traffic. This suppose to train the rate control algorithm to keep a correct rate information to a specific peer but looks not enough. I still get wrong rate to a "POTENTIAL" peer. It maybe because of the RateControl behavior.

zhejunli commented 6 years ago

It is the Minstrel rate control that underestimates the TX rate to a POTENTIAL mpath peer. The test frames can not keep Minstrel to adjust to a proper, real rate. Looks like the test frames are too small traffic and only higher traffic can let Minstrel make a right decision.

bcopeland commented 6 years ago

On Tue, Apr 03, 2018 at 07:55:01AM -0700, zhejunli wrote:

It is the Minstrel rate control that underestimates the TX rate to a POTENTIAL mpath peer. The test frames can not keep Minstrel to adjust to a proper, real rate. Looks like the test frames are too small traffic and only higher traffic can let Minstrel make a right decision.

Indeed. Also, IIRC, minstrel assumes every unsampled rate/link has 0% probability to begin with. So it would be most accurate to say most of the time, we don't know how good the potential link is. Perhaps integrating a confidence into the sampling selection algorithm could help this use case.

zhejunli commented 6 years ago

Yes that will help. And that involves the Minstrel part which I don't want to touch for the time being.

Per my understanding, 802.11s claims that it can DYNAMICALLY choose a better mpath by periodically deactivating an existing mpath and generating PREQ to search for a lower metric mpath. But, if a "POTENTIAL" mpath peer's rate ,PER status are outdated, the new path search will cause a fake "optimal" mpath.

Is this a defect of 802.11s protocol?

zhejunli commented 5 years ago

Some tweaks are made and looks good so far. Still 2 major issues (I believe) left:

  1. authsae : key exchange may fail because of an ath9k key cache bug. This is a topic of another post. Don't know if using hostapd/supplicant for secure mesh will be better than authsae.
  2. Because of broadcasting nature, multi-hop scenario will cause the HWMP packet loss easily so a path is hard to be established.