myriadrf / LimeSuite

Driver and GUI for LMS7002M-based SDR platforms
https://myriadrf.org/projects/lime-suite/
Apache License 2.0
469 stars 185 forks source link

RX bursts from channel 0 are not aligned properly for MIMO configuration #141

Closed ccsh23 closed 6 years ago

ccsh23 commented 7 years ago

Dear all,

Long story short: I have found following bug in LimeSDR, because of which I cannot move on with my project: when using MIMO burst-based transmission, time alignment of content of ca. 20% of bursts received by channel 0 is incorrect, while time alignment of all burst received via channel 1 is correct. For SISO, time alignment of RX bursts captured via channel 0 (the only available) is also correct. Problem definitely is on RX side, as I was using other trusted receiver (USRP B200 mini) to carefully check time-separation and length of bursts transmitted via TX ports of LimeSDR and did not observe anything strange (no clipping, no floating of subsequent bursts position etc). Its occurrence does not seem to depend on sampling rate, used frequency etc. It is also not a software error, matter of host machine or particular LimeSDR board, as I have observed the issue with different LimeSDR boards, different host computers and it did not happen even once when I was using different SDR, eg. USRP devices. It should also be noted, that during my tests device was not reporting any TX or RX calls errors (such as time errors etc). I was using custom application based on SoapySDR API.

Longer bug report and way to reproduce the issue: As one of MyriadRF forum members (Zero) reported similar issue, I decided to check TX & RX capabilities of LimeSDR as it comes to bursts-based MIMO transmission.

I was using sampling rate of 16.666667 MHz (master clock rate of 133.333333 MHz and divider=8), center frequency of 1.455 GHz and NCO frequency offset of -5 MHz for TX channel 0 (so the actual center frequency for channel 0 was 1.45 GHz) and NCO frequency offset of +5 MHz for TX channel 1 (so the actual center frequency for channel 1 was 1.46 GHz). My bursts for both TX channels were 2 ms long (with 100 μs of silence before and after actual burst in order to consume transients, so total length of TX buffer was 2.2 ms) and repeated every 25 ms. As modulating wave I have used BPSK signals with bit rate of 2 Mbit/s for TX channel 0 (which gives main lobe width of 4 MHz in RF) and bit rate of 1 Mbit/s for TX channel 1 (which gives main lobe width of 2 MHz in RF). I have used standard omni-directionall, tri-band antennas which were sold along LimeSDR for both TX channels. Following images of transmitted signals in both time and frequency domain shows that transmission looks fine (these were captured with USRP B200 mini): spectrum_rx_usrp power_rx_usrp

I have applied the same settings and antennas for both RX channels 0 and 1. I was also using the same timestamps for TX and RX operations (or to be more precise: timestamps of bursts scheduled in TX thread were always 2·burst_period=2·25=50 ms later than corresponding rx timestamps, in order to give the device some time to work with).

Then I wanted to confirm that I am actually receiving my bursts correctly. While doing full BPSK demodulation and detection was not the purpose of this test, I decided to use simpler approach and just check the power of received bursts. So I was calculating power of received samples and applying moving average filtration to it, in order to reduce fluctuations. After that, I was expecting to see significant increase of power of received signal during this 2 ms „active” part of RX buffer and much lower power for these 100 μs "inactive" parts at the beginning and at the end of RX buffer. Indeed, for all bursts received with RX channel 1 and majority (ca. 80%) of bursts received with channel 0 it looked fine, as in the following picture: burst_correct

To be more precise, for these bursts I have observed constant offset of 94 samples (ca. 5.64 μs) in beginning of actual signal in RX buffers compared to beginning of this signal in TX buffers, but I suppose such small and constant offset is normal, as I observed it earlier for other SDR devices as well. Moreover, it is always the same for different program runs, so may be easily compensated. Here is part of my program output proving this (+/- several sample offset is actually normal when finding beginning of a signal with such simple power-detection-based method):

data_begin=96[Sa] (5.760000[us])
data_begin=94[Sa] (5.640000[us])
data_begin=95[Sa] (5.700000[us])
data_begin=94[Sa] (5.640000[us])
data_begin=95[Sa] (5.700000[us])
data_begin=94[Sa] (5.640000[us])
data_begin=96[Sa] (5.760000[us])
data_begin=94[Sa] (5.640000[us])
data_begin=96[Sa] (5.760000[us])
data_begin=94[Sa] (5.640000[us])

But what bothers me a lot is floating time alignment of other ca. 20% of cases of bursts received with RX channel 0. They seem to be captured too early and time of capture may differ in many thousands of samples (i.e. several hundreds of microseconds):

data_begin=9785[Sa] (587.100001[us])
data_begin=16414[Sa] (984.840002[us])
data_begin=16415[Sa] (984.900002[us])
data_begin=7234[Sa] (434.040001[us])
data_begin=16414[Sa] (984.840002[us])
data_begin=6215[Sa] (372.900001[us])
data_begin=3155[Sa] (189.300000[us])
data_begin=2135[Sa] (128.100000[us])
data_begin=16415[Sa] (984.900002[us])
data_begin=13864[Sa] (831.840002[us])

I have also confirmed these results by plotting signal power for such cases and it also showed that these bursts were indeed captured too early:

burst_short1 burst_short2 burst_short3

Sometimes it even happened that RX buffer for channel 0 did not fall into TX burst region at all:

burst_empty

To sum up, issue is really serious as ca. 20% of transmission on channel 0 is affected, so I really hope for some fix as for now I am unable to move on with my project. Interesting fact: I tried to change frequencies, bit rates, setup order etc. of channels 0 and 1, but always only RX channel 0 was affected.

I hope that this problem will be addressed soon and offer help with testing potential solutions.

ccsh23 commented 7 years ago

Looks like the issue is fixed in the newest version of LimeSuite :)

ccsh23 commented 7 years ago

Unfortunately, it looks like I have called it off too soon. The problem was not occurring anymore for that particular configuration which I have used, but after changing it, I can still see the same issue. I have used the same configuration like described in first post, but have increased T_data to 50 ms and T_burst to 250 ms. For such configuration it works okay for a little longer than one second and then every next burst received via channel 0 is fired too soon.

Here is log from my app:

ch.0: data_begin=96[Sa] (5.760000[us]), data_end=833427[Sa] (50005.620125[us]), data_length=833331[Sa] (49.999860[ms]), N_power_drops=0
ch.1: data_begin=98[Sa] (5.880000[us]), data_end=833426[Sa] (50005.560125[us]), data_length=833328[Sa] (49.999680[ms]), N_power_drops=0
ch.0: data_begin=95[Sa] (5.700000[us]), data_end=833426[Sa] (50005.560125[us]), data_length=833331[Sa] (49.999860[ms]), N_power_drops=0
ch.1: data_begin=97[Sa] (5.820000[us]), data_end=833426[Sa] (50005.560125[us]), data_length=833329[Sa] (49.999740[ms]), N_power_drops=0
ch.0: data_begin=95[Sa] (5.700000[us]), data_end=833426[Sa] (50005.560125[us]), data_length=833331[Sa] (49.999860[ms]), N_power_drops=0
ch.1: data_begin=97[Sa] (5.820000[us]), data_end=833425[Sa] (50005.500125[us]), data_length=833328[Sa] (49.999680[ms]), N_power_drops=0
ch.0: data_begin=95[Sa] (5.700000[us]), data_end=833426[Sa] (50005.560125[us]), data_length=833331[Sa] (49.999860[ms]), N_power_drops=0
ch.1: data_begin=97[Sa] (5.820000[us]), data_end=833425[Sa] (50005.500125[us]), data_length=833328[Sa] (49.999680[ms]), N_power_drops=0
ch.0: data_begin=95[Sa] (5.700000[us]), data_end=833426[Sa] (50005.560125[us]), data_length=833331[Sa] (49.999860[ms]), N_power_drops=0
ch.1: data_begin=97[Sa] (5.820000[us]), data_end=833426[Sa] (50005.560125[us]), data_length=833329[Sa] (49.999740[ms]), N_power_drops=0
ch.0: data_begin=6894[Sa] (413.640001[us]), data_end=834998[Sa] (50099.880125[us]), data_length=828104[Sa] (49.686240[ms]), N_power_drops=0
ch.1: data_begin=98[Sa] (5.880000[us]), data_end=833425[Sa] (50005.500125[us]), data_length=833327[Sa] (49.999620[ms]), N_power_drops=0
ch.0: data_begin=146975[Sa] (8818.500022[us]), data_end=834998[Sa] (50099.880125[us]), data_length=688023[Sa] (41.281380[ms]), N_power_drops=0
ch.1: data_begin=97[Sa] (5.820000[us]), data_end=833426[Sa] (50005.560125[us]), data_length=833329[Sa] (49.999740[ms]), N_power_drops=0
ch.0: data_begin=163296[Sa] (9797.760024[us]), data_end=834998[Sa] (50099.880125[us]), data_length=671702[Sa] (40.302120[ms]), N_power_drops=0
ch.1: data_begin=98[Sa] (5.880000[us]), data_end=833426[Sa] (50005.560125[us]), data_length=833328[Sa] (49.999680[ms]), N_power_drops=0
ch.0: data_begin=163295[Sa] (9797.700024[us]), data_end=834998[Sa] (50099.880125[us]), data_length=671703[Sa] (40.302180[ms]), N_power_drops=0
ch.1: data_begin=97[Sa] (5.820000[us]), data_end=833425[Sa] (50005.500125[us]), data_length=833328[Sa] (49.999680[ms]), N_power_drops=0
ch.0: data_begin=163295[Sa] (9797.700024[us]), data_end=834998[Sa] (50099.880125[us]), data_length=671703[Sa] (40.302180[ms]), N_power_drops=0
ch.1: data_begin=97[Sa] (5.820000[us]), data_end=833426[Sa] (50005.560125[us]), data_length=833329[Sa] (49.999740[ms]), N_power_drops=0
ch.0: data_begin=163295[Sa] (9797.700024[us]), data_end=834998[Sa] (50099.880125[us]), data_length=671703[Sa] (40.302180[ms]), N_power_drops=0
ch.1: data_begin=98[Sa] (5.880000[us]), data_end=833425[Sa] (50005.500125[us]), data_length=833327[Sa] (49.999620[ms]), N_power_drops=0
ch.0: data_begin=163295[Sa] (9797.700024[us]), data_end=834998[Sa] (50099.880125[us]), data_length=671703[Sa] (40.302180[ms]), N_power_drops=0
ch.1: data_begin=97[Sa] (5.820000[us]), data_end=833426[Sa] (50005.560125[us]), data_length=833329[Sa] (49.999740[ms]), N_power_drops=0
ch.0: data_begin=163296[Sa] (9797.760024[us]), data_end=834998[Sa] (50099.880125[us]), data_length=671702[Sa] (40.302120[ms]), N_power_drops=0
ch.1: data_begin=98[Sa] (5.880000[us]), data_end=833427[Sa] (50005.620125[us]), data_length=833329[Sa] (49.999740[ms]), N_power_drops=0
ch.0: data_begin=163295[Sa] (9797.700024[us]), data_end=834998[Sa] (50099.880125[us]), data_length=671703[Sa] (40.302180[ms]), N_power_drops=0
ch.1: data_begin=97[Sa] (5.820000[us]), data_end=833425[Sa] (50005.500125[us]), data_length=833328[Sa] (49.999680[ms]), N_power_drops=0

And here are the pictures confirming that the issue is still there: burst_ch0_50ms_250ms burst_ch1_50ms_250ms

I really hope for a fix, as for mentioned configuration basically all the transmission received on channel 0 (excluding several bursts at the beginning) has incorrect timing.

guruofquality commented 7 years ago

@ccsh23 Can you run this branch? I added some prints in the readStream call when the time/return size mismatch (rx_mimo_prints branch). https://github.com/myriadrf/LimeSuite/commit/52cf938e9347aa5717279bf55efbd71cd489cdc8 Just curious if any of the prints are firing off in you tests.

I'm not sure what else it could be but there is a place where we could plainly have a race issue because the rx samples come from two different fifos and we dont make sure the exact same amount is read out of each fifo.

ccsh23 commented 7 years ago

Hi Josh,

thanks for response! I guess you got it right, as I can see these prints:

2017-10-16 16:46:43.977738 INFO -     It's okay if you don't see any further information. Press Ctrl+C to finish.
[INFO] L
mismatch times42503331, 42581430
mismatch times43339996, 43418095
mismatch times44176661, 44254760
mismatch times45013326, 45091425
mismatch times45849991, 45928090

2017-10-16 16:46:46.977939 INFO -     ch.0: P_max=0.073110, P_min=0.000031, P_avg=0.063625, P_thr=0.036570, P_max/P_min=33.777090[dB], data_begin=78194[Sa] (4691.640012[us]), data_end=834998[Sa] (50099.880125[us]), data_length=756804[Sa] (45.408240[ms]), N_power_drops=0
mismatch times47167350, 47346870
mismatch times48004015, 48183535
mismatch times48840680, 49020200
mismatch times49677345, 49856865
mismatch times50514010, 50693530

2017-10-16 16:46:47.250939 INFO -     ch.0: P_max=0.073145, P_min=0.000030, P_avg=0.055000, P_thr=0.036587, P_max/P_min=33.914026[dB], data_begin=179615[Sa] (10776.900027[us]), data_end=834998[Sa] (50099.880125[us]), data_length=655383[Sa] (39.322980[ms]), N_power_drops=0
mismatch times51688500, 51867510
mismatch times52525165, 52704175
mismatch times53361830, 53540840
mismatch times54198495, 54377505

2017-10-16 16:46:47.512713 INFO -     ch.0: P_max=0.073063, P_min=0.000029, P_avg=0.055027, P_thr=0.036546, P_max/P_min=34.067865[dB], data_begin=179105[Sa] (10746.300027[us]), data_end=834998[Sa] (50099.880125[us]), data_length=655893[Sa] (39.353580[ms]), N_power_drops=0
mismatch times56061750, 56241270
mismatch times56898415, 57077935
mismatch times57735080, 57914600
mismatch times58571745, 58751265

2017-10-16 16:46:47.769788 INFO -     ch.0: P_max=0.072925, P_min=0.000030, P_avg=0.054898, P_thr=0.036477, P_max/P_min=33.880794[dB], data_begin=179616[Sa] (10776.960027[us]), data_end=834998[Sa] (50099.880125[us]), data_length=655382[Sa] (39.322920[ms]), N_power_drops=0
mismatch times60337590, 60517110
mismatch times61174255, 61353775
mismatch times62010920, 62190440
mismatch times62847585, 63027105

2017-10-16 16:46:48.024039 INFO -     ch.0: P_max=0.072979, P_min=0.000032, P_avg=0.055026, P_thr=0.036506, P_max/P_min=33.591815[dB], data_begin=179615[Sa] (10776.900027[us]), data_end=834998[Sa] (50099.880125[us]), data_length=655383[Sa] (39.322980[ms]), N_power_drops=0
mismatch times64580790, 64760310
mismatch times65417455, 65596975
mismatch times66254120, 66433640
mismatch times67090785, 67270305

2017-10-16 16:46:48.276821 INFO -     ch.0: P_max=0.073194, P_min=0.000026, P_avg=0.055048, P_thr=0.036610, P_max/P_min=34.412390[dB], data_begin=179616[Sa] (10776.960027[us]), data_end=834998[Sa] (50099.880125[us]), data_length=655382[Sa] (39.322920[ms]), N_power_drops=0
guruofquality commented 7 years ago

@ccsh23 I think that clued me into whats happening. It looks like a side effect of moving to multiple fifos for mimo. And probably also a result of timeouts and scheduling...

So assuming that was the cause, please try this branch now soapylms_mimo_align_fix. Which make an extra effort to keep the reads and writes matched across channels and will as in the other branch, print if this condition isnt met for some reason. Curious how that goes for your applications.. Thanks!

ccsh23 commented 7 years ago

Okay, so now with the same settings everything is okay for two calls and for all further calls I am getting no error code but number of received samples is 0 and I got multi-channel stream alignment fail:

2017-10-17 18:04:23.406748 INFO -     It's okay if you don't see any further information. Press Ctrl+C to finish.
[INFO] L

2017-10-17 18:04:25.835594 INFO -     ch.0: P_max=0.041209, P_min=0.000006, P_avg=0.040065, P_thr=0.020607, P_max/P_min=38.643436[dB], data_begin=96[Sa] (5.760000[us]), data_end=833426[Sa] (50005.560125[us]), data_length=833330[Sa] (49.999800[ms]), N_power_drops=0
2017-10-17 18:04:25.928152 INFO -     ch.1: P_max=0.053520, P_min=0.000004, P_avg=0.051503, P_thr=0.026762, P_max/P_min=41.462127[dB], data_begin=97[Sa] (5.820000[us]), data_end=833427[Sa] (50005.620125[us]), data_length=833330[Sa] (49.999800[ms]), N_power_drops=0
2017-10-17 18:04:26.118725 INFO -     ch.0: P_max=0.041252, P_min=0.000006, P_avg=0.040050, P_thr=0.020629, P_max/P_min=38.683052[dB], data_begin=95[Sa] (5.700000[us]), data_end=833426[Sa] (50005.560125[us]), data_length=833331[Sa] (49.999860[ms]), N_power_drops=0
2017-10-17 18:04:26.211627 INFO -     ch.1: P_max=0.053560, P_min=0.000004, P_avg=0.051526, P_thr=0.026782, P_max/P_min=41.379556[dB], data_begin=96[Sa] (5.760000[us]), data_end=833426[Sa] (50005.560125[us]), data_length=833330[Sa] (49.999800[ms]), N_power_drops=0
[ERROR] Multi-channel stream alignment failed!
2017-10-17 18:04:26.231866 WARNING -  sdr: [RX]  current_time=2791791034, burst_time: 2749999987, read_code: NO_ERROR, no_of_received_samples: 0/836665
[ERROR] Multi-channel stream alignment failed!
2017-10-17 18:04:26.252084 WARNING -  sdr: [RX]  current_time=2812354234, burst_time: 2999999985, read_code: NO_ERROR, no_of_received_samples: 0/836665
[ERROR] Multi-channel stream alignment failed!
2017-10-17 18:04:26.272589 WARNING -  sdr: [RX]  current_time=2832917435, burst_time: 3249999983, read_code: NO_ERROR, no_of_received_samples: 0/836665
[ERROR] Multi-channel stream alignment failed!
2017-10-17 18:04:26.293072 WARNING -  sdr: [RX]  current_time=2853480635, burst_time: 3499999981, read_code: NO_ERROR, no_of_received_samples: 0/836665

To be honest I don't know how to interpret zero number of received samples (this can occur also in SISO applications, it is not tied to this particular issue). My guess is that this is in fact caused by late command, so the device did not manage to provide requested samples, even though at the moment when it has received request hardware timestamp was smaller than request timestamp (and that is why TIME ERROR is not reported)? If that is the case, then I think this branch is better in that sense that the issue is at least reported back in readStream (via this 0 number of received samples). But since first two calls are correct and all the samples are received successfully, I am not sure if that is the case...

guruofquality commented 7 years ago

@ccsh23 are you able to share something that duplicates this? like code snippet, source file, particular timeout, rate and buffer sizes? If I can duplicate the problem I can probably get to the bottom of this pretty quickly.

ccsh23 commented 7 years ago

Sure, I just sent the code, look at your email please.

guruofquality commented 7 years ago

@ccsh23 I think I have this resolved.There was a similar issue in the rx thread that was writing to the loop. I made sure to match the write sizes when the loop was operating on multiple channels. The change is also pushed to the same branch.

ccsh23 commented 7 years ago

@guruofquality, thank you for your effort. I can confirm that I no longer got misalignment errors, in fact I got no errors at all for both configurations mentioned above. I guess it would be best to push this fix to master branch, or try one more time to convince IgnasJ that this should actually be fixed at lower level (I agree that it sounds like a better idea).

Anyway, thank you so much for finding the reason and fixing it, I know it was not so straightforward :)