krakenrf / heimdall_daq_fw

Coherent data acquisition signal processing chain for multichannel SDRs
GNU General Public License v3.0
66 stars 31 forks source link

Race condition in rtl_daq.c causing data corruption in high-load situations #3

Closed nuena closed 2 years ago

nuena commented 2 years ago

Short description

If rtl_daq.c is not able to push data downstream fast enough, data reads and writes on the same memory sections can occur simultaneously, leading to undefined behaviour and incoherent data between the receiver channels

Steps to reproduce Unknown, as I observed this on my own fork of the software, running on a RPi 4 and with added sample export (which will have an impact on performance). However, this should always happen if the performance is not high enough for rtl_daq.c to process data in real time. While I cannot say if anybody else will encounter this issue, I still suggest implementing at least my workaround as to avoid silent undefined behaviour.

Explanation

rtl_daq.c maintains the circular buffer rtl_rec, which contains NUM_BUFF (8) slots to which rtlsdrCallback can write, and from which the infinite while-loop in main reads. The index of the "slot" to which is written is determined by calculating

300 int wr_buff_ind = rtl_rec->buff_ind % NUM_BUFF; // Calculate current buffer index in the circular buffer

and the index of the slot from which is read is calculated by calculating

671 rd_buff_ind = read_buff_ind % NUM_BUFF;

where rtl_rec->buff_ind and read_buff_ind are two counters, that count how many packets are read from the device and how many are sent downstream, respectively.

There are checks (ll. 600-605) which aim to ensure that data is not read from the same slot by enforcing rtl_rec->buff_ind <= read_buff_ind, however if rtl_rec->buff_ind is an integer multiple of NUM_BUFF larger than read_buff_ind, this check will not fail, even though the modulo function will refer to the same slot in the circular buffer. In this case, rtlsdrCallback may write data while main is reading.

Proposed solution I believe there is no "clean" solution to this issue that does not cause data loss. Increasing NUM_BUFF (potentially dynamically to use most of the available resources?) will decrease the likelyhood of this occuring, however at some point data must be dropped if some part of the toolchain is not able to keep up.

Instead, I propose to ensure that no simultaneous read and write occures and issue a warning in this case: Add a second check to the for-loop in lines 600-605:

if(rtl_rec->buff_ind % NUM_BUFF == read_buff_ind % NUM_BUFF) { 
    log_warn("Likely race condition. Skipping data aquicision. RTL Buff index: %d, read_buff_ind: %d", rtl_rec->buff_ind, read_buff_ind);           
    data_ready = 0;
    break;
}                                                                  

This will wait until the buffer "slots" are unequal. It might also be preferable to "fast-forward" read_buff_ind (i.e. set it to rtl_rec->buff_ind?) to indicate a data loss downstream due to the jump. However, I have no clue if this might cause other issues.

krakenrf commented 2 years ago

Just wanted to say thanks for your notes and potential fixes. Our DSP dev will be looking into this soon. We have also seen the same problems you describe when under high load. This problem was mostly handled thus far by setting the daq chain to run as a high priority task in Linux, but that is also certainly not a great solution.

petotamas commented 2 years ago

Thank you for the report and the suggestion. I have included your checks and added one more warning message.

In the earlier version of the code, we have used mutexes to exclude the possibility of race condition, but it turned to that it unnecessarily slowed down the code, and it caused sometimes sample losses.

In case the circular buffer is corrupted, the delay synchronizer module will likely catch this error, as it continuously checks the sample synchrony of the channels.

We are trying to continuously verify the correct operation on Raspberry Pi, I would always recommend to run the unit tests for not verified architectures to reveal potential sample loss problems. As you highlighted the increase of the number of buffers may simply solve the problem.