virtualabs / btlejack-firmware

Btlejack firmware for BBC Micro:Bit
MIT License
46 stars 28 forks source link

Maybe bug fix: TASKS_START might be invoked head of time #9

Open nviennot opened 5 years ago

nviennot commented 5 years ago

I don't have full confidence in this patch. I have not tested all functionalities, but it seems to work pretty well.

Problem: I was trying to capture all packets including zero length packets which are normally not sent. So I modified the firmware, and I noticed that I was getting a lot of corrupt packets. An easy way to manifest the corruption bug is to invoke the call to pLink->notifyNordicTapBlePacket() in main.cpp:838 twice. Normally, one should see the same packets twice coming when sniffing new connections with btlejack -c any -v. Oddly, we see many packets that are corrupted.

In turns out, the radio DMA controller overwrites the rx_buffer as we are copying it in notifyNordicTapBlePacket() . This should not happen as we do not trigger a NRF_RADIO->TASKS_START = 1 until after the UART has copied the bytes.

So it turned out, after while (NRF_RADIO->EVENTS_READY == 0); in radio_set_sniff(), we set NRF_RADIO->TASKS_START = 1, without reseting the NRF_RADIO->EVENTS_READY flag to 0. Eventually, the END event triggers and we land in the interrupt routine RADIO_IRQHandler(). But because we didn't put the NRF_RADIO->EVENTS_READY flag to 0 earlier, we test it to 1 and so we set TASKS_START = 1, which is problematic because now the receiver is allowed to overwrite the rx_buffer while we are using it.

Normally, this should only corrupt the first packet. However, I saw endless corrupted packets. So I am not confident that this is the right explanation. And so I'm not confident that this is the right fix.

The IRQ handler is set to only trigger on END events, not READY events. So we can only get into the IRQ handler by having been ready in the first place. I am quite confused. Maybe someone can shed some light on this.

virtualabs commented 5 years ago

When I started the development of this firmware, it was configured to send every packets received, and it worked pretty well. Maybe this issue resides in the UART management code, combined with your modification in btlejack (using select() instead of burning the CPU) ? Can this issue be caused by a UART buffer overwritten while RADIO_IRQHandler() is called ?

nviennot commented 5 years ago

it was configured to send every packets received, and it worked pretty well. Maybe this issue resides in the UART management code, combined with your modification in btlejack (using select() instead of burning the CPU) ? the select() mod is not the issue. The packets that are coming in have a valid checksum, but with corrupted content.

Can this issue be caused by a UART buffer overwritten while RADIO_IRQHandler() is called ? No, copying the rxbuffer to a temporary buffer tmprx, and then calling notifyNordicTapBlePacket() with the temporary buffer makes the bug go away. Similarly, setting NRF_RADIO->PACKETPTR to NULL during the calls to notifyNordicTapBlePacket() makes the bug go away.

The following commit exhibits the bug well: https://github.com/nviennot/btlejack-firmware/commit/f9368d170108d40a92db86a119ae6c9bd8bf77e9 It sets the rx_buffer to all 0x11. Then, it calls notifyNordicTapBlePacket() twice.

When I run with btlejack with the modified firmware to show the bug, from the master branch (no select code) here is what I get:

$ btlejack -c any
BtleJack version 1.3

[i] Detected sniffers:
 > Sniffer #0: version 1.3
LL Data: 05 22 6c d6 42 f6 73 7d 3c 0d a1 99 76 68 3d 26 39 61 f6 fb 12 02 11 00 24 00 00 00 f4 01 ff ff ff ff 1f 07
[i] Got CONNECT_REQ packet from 7d:73:f6:42:d6:6c to 68:76:99:a1:0d:3c
 |-- Access Address: 0x6139263d
 |-- CRC Init value: 0x12fbf6
 |-- Hop interval: 36
 |-- Hop increment: 7
 |-- Channel Map: 1fffffffff
 |-- Timeout: 5000 ms

LL Data: 11 11 11 11 11 11 11 11 11 11 11 11 11 11 11 11 11 11 11
LL Data: 05 00
LL Data: 11 11 11 11 11 11 11 11 11 11 11 11 11 11 11 11 11 11 11
LL Data: 11 11 11 11 11 11 11 11 11 11 11 11 11 11 11 11 11 11 11
LL Data: 11 11 11 11 11 11 11 11 11 11 11 11 11 11 11 11 11 11 11
LL Data: 05 00
LL Data: 11 11 11 11 11 11 11 11 11 11 11 11 11 11 11 11 11 11 11
LL Data: 11 11 11 11 11 11 11 11 11 11 11 11 11 11 11 11 11 11 11
LL Data: 11 11 11 11 11 11 11 11 11 11 11 11 11 11 11 11 11 11 11
LL Data: 11 11 11 11 11 11 11 11 11 11 11 11 11 11 11 11 11 11 11
LL Data: 11 11 11 11 11 11 11 11 11 11 11 11 11 11 11 11 11 11 11
LL Data: 11 11 11 11 11 11 11 11 11 11 11 11 11 11 11 11 11 11 11
LL Data: 11 11 11 11 11 11 11 11 11 11 11 11 11 11 11 11 11 11 11
LL Data: 05 00
LL Data: 11 11 11 11 11 11 11 11 11 11 11 11 11 11 11 11 11 11 11
LL Data: 09 00
LL Data: 11 11 11 11 11 11 11 11 11 11 11 11 11 11 11 11 11 11 11
LL Data: 11 11 11 11 11 11 11 11 11 11 11 11 11 11 11 11 11 11 11
LL Data: 11 11 11 11 11 11 11 11 11 11 11 11 11 11 11 11 11 11 11
LL Data: 09 00
LL Data: 11 11 11 11 11 11 11 11 11 11 11 11 11 11 11 11 11 11 11
LL Data: 11 11 11 11 11 11 11 11 11 11 11 11 11 11 11 11 11 11 11

We should only see the LL Data: 11 11 11 11 11 11 11 11 11 11 11 11 11 11 11 11 11 11 11 lines, not the 05 00 nor 09 00. These are not UART corruption problems as the checksums are correct (I checked).

virtualabs commented 5 years ago

Yes, definitely looks like rx_buffer is overwritten. I will have a deeper look at this issue, thanks !

virtualabs commented 4 years ago

A quick update on this one. I had some similar issues when I was trying to implement a new feature a few days ago, and malformed packets were sent as CONNECT_REQ packets while they were in fact not CONNECT_REQ packets and this caused the python client to fail on an assertion.

I spent some time investigating this issue and you were right: on some occasions TASKS_START is enabled BEFORE sending buffer to UART, and if a packet is received right after the radio has been enabled and put in RX mode and before a call to notifySomething() is made then the RX buffer is overwritten and bad data is sent to the host computer.

I fixed a similar bug that appeared when dealing with CONNECT_REQ packets, but I will definitely rewrite the radio module in order to avoid this and investigate the main code to find similar issues as well. It may takes time to fix all of this, but I guess it will improve the stability of the tool.

virtualabs commented 4 years ago

I think I found what happens. The data you got (05 00 and 09 00) while testing with a forced RX buffer (all bytes set to 0x11) are empty PDUs. These PDUs are sent by the master to the slave at the beginning of each connection event.

I think the timer in charge of hopping from one channel to another collides with the radio IRQ handler. I mean, the timer causes the radio to listen on another channel and it enables the TASKS_START at the end of the sync_hop_channel() function. If a packet is then received, it will overwrite the RX buffer and a call to notifyNordicTapBlePacket() will obviously send this new packet rather than the one that triggered the call to the radio IRQ handler. And in this specific case, this packet would be an empty PDU because it is the first packet sent by the master on the channel.

A possible fix could be to try to avoid sending packets to UART when it is too late considering the connection event, or to postpone them to the next connection event. It will be delayed, but then we should have enough time to send two packets instead of one for instance. Disabling the timer while we are sending a packet to UART may lead to packets loss and connection de-synchronization, so postponing is maybe the best option. It may require some time to find the correct way to implement this and to test it thoroughly.

But I'm pretty confident that is the reason why you capture some "corrupted" packets, and it may explain some instability I experienced while sniffing connections (connection lost sometimes with weird packets captured).

nviennot commented 4 years ago

Ah! It might be a good idea to have a FIFO queue of messages. the radio would push some messages to the queue, and the uart pops messages out. This way, it's hard to drop packets on the floor.