jks-prv / Beagle_SDR_GPS

KiwiSDR: BeagleBone web-accessible shortwave receiver and software-defined GPS
http://kiwisdr.com
464 stars 158 forks source link

Samples dropped periodically #97

Closed dev-zzo closed 5 years ago

dev-zzo commented 7 years ago

I'm experimenting with various non-audio modes using KiwiSDR and currently receiving radiofax is in focus. It works fantastic except one minor issue: once in a while, samples are dropped at KiwiSDR (I'm receiving every packet, so no drops in transit) which results in data/sync loss.

Example results from radiofax reception: https://yadi.sk/d/2LwxdZiz3HuKTi

Jags from sample loss are clearly visible. Apparently, the loss is not large enough to be noticeable for the ear, but digital modes will be affected heavily.

jks-prv commented 7 years ago

I assume you're intercepting the audio being generated by the browser? Like with a VAC or something?

The code currently does not compensate for the slight difference in audio sample rates produced by the Kiwi and the sound hardware on the computer running the browser. This is less of a problem than it used to be in the old days when PCs used sound cards with crappy crystals. But it's probably still an issue for a time-sensitive application like this.

For example if the Kiwi is producing audio samples at 44099.991 Hz and the computer sound hardware is consuming them at 44100.001 Hz the audio buffer in javascript of the browser will eventually under-run. There is an audio under-run counter in the status panel at lower left when you connect (only shows when it's non-zero). The thing that says "Qlen" is the audio queue length. If it's slowly dropping towards zero or increasing into the double digits you are heading into an over/under-run condition. There are more detailed counters on the status tab of the admin page (where it says "Errors: dropped, underruns, sequence").

The solution is to do frequency correction ("pitch bending" as some would know it) when the buffer is getting out of range. Audio IQ is sent over the network at a reduced rate (and compressed) so there is an interpolator to do the conversion. That's an excellent place to be doing the frequency correction. It involves adding a digital PLL and we haven't figured that out yet (WebSDR does it this way).

The other possibility is that these are dropouts caused by realtime response problems in javascript running on the browser. That's particularly bad because there isn't much we can do about that. The real solution there is to become a consumer of the network audio samples (in parallel with the browser) instead of using the browser audio.

As if this weren't bad enough there is yet another problem. What if you have dropouts caused by stalls on the Internet? This is bad because the data may eventually be delivered when the stall clears. But by then the audio has under-run in the browser and any newly-arriving data is now out of sync with realtime. Now it's actually possible to detect this and get the samples "back on track". And this might be the real issue you are seeing. But again this is something we haven't implemented yet. It requires time-stamping the samples or something similar. There are folks who want IQ data (as opposed to the mono audio) timestamped with GPS-level timing, so these two issues probably need a common solution.

dev-zzo commented 7 years ago

Hi John, thanks for your extensive and quick response. Unfortunately, I hacked together a WebSocket client in Python to act in place of the browser and avoid all the nasty audio rate conversion business and get nice raw sample packets instead. This seems to protect from all the issues you mentioned above, at least in theory, as the processing does not have to be realtime any more and is not affected by stalls or other breakage of realtime constraints. I see no packets missing -- sequence numbers are correct, and packets come in the right order. Of course, there is a chance that my code is broken somewhere and drops a packet or two once in a while, but the same behaviour has been confirmed through the expected use case with the browser and Sorcerer to decode stuff.

Is there any way we can debug this further?

dev-zzo commented 7 years ago

I've browsed the code base a bit and while I am no expert on how KiwiSDR actually works, this caught my eye:

https://github.com/jks-prv/Beagle_SDR_GPS/blob/master/rx/rx_sound.cpp#L153

Could it be that this correction results in a silent discard of one packet of data, for example?

jks-prv commented 7 years ago

Very interesting that you've already written some code to take the browser, javascript interpreter and client-side of the Kiwi code out of the equation. That helps narrow a few things down.

I'd really like to know if during a fax with shifted scan-lines any audio under or over-runs are occurring. Just look for the red messages in the status panel at lower left, or non-zero counts on the status tab of the admin page.

The second thing you mentioned, the correction of the ADC clock by GPS timing solutions, may be responsible for another problem that has been reported. Whether it could also cause the problem you are seeing I don't know. I'll include a picture below and describe.

What's been reported is that occasionally there are big frequency steps in the WSPR waterfall. Now this is very odd because any GPS correction should be relatively small. Not the 50 to 100 Hz shown in the pictures (except for the very first correction made if the ADC XO is way off due to a large temperature difference from 20 degC). The other suspicious thing about the picture is the shifting seems to be related to the 2 minute WSPR decode boundaries. This tells me it's probably WSPR specific and not really the ADC/GPS. Like a bug in the WSPR audio sample rate converter.

mobilbilder 766

dev-zzo commented 7 years ago

I've also noticed (using multiple receivers from sdr.hu) that this behaviour depends both on the receiver as well as on the frequency. In one case, the breakage looks almost deterministic (a small step occurs within X seconds), in other case a major clobbering occurs also periodically, with minor losses inbetween. This seems to rule out problems in my own code as I'd expect errors to be either random or manifesting in a similar manner across all receivers out there.

jks-prv commented 7 years ago

Okay, I need to do some detailed investigation. When I can I've been working on a timecode decoder extension for some of the LF time signal stations (WWVB, TDF, etc.) These of course would be very sensitive to any sample drops. The signal processing is done in javascript on the browser, so the audio sample stream is used after transmission over the Internet. This is essentially the same data you'd get over a VAC using external software.

dev-zzo commented 7 years ago

Another observation: if tuned to the same frequency with the same receiver, the losses occur at exact same time.

jks-prv commented 7 years ago

That's good to know. A repeatable problem like that usually means a software bug and not just random Internet delays. Maybe I have some sort of pipeline startup bug in the audio buffering or something.

dev-zzo commented 7 years ago

I tried measuring the actual drop; it seems the order is about 6 to 20 samples at a time; not that much but enough to break the image. I imagine there is no way one would actually hear that 20 samples are missing from the audio stream once in a while. :-)

jks-prv commented 7 years ago

I've been starting to experiment with a fax decoder extension: http://valentfx.com/forums/#/discussion/715/noaa-fax

It runs on the server-side of the Kiwi (Beagle) and uses the audio sample stream that is available after the passband filtering, demodulators and AGC. Just before the samples are compressed and sent over the net to the client. It's a little early to fully judge but I think the sampling at that particular point looks okay. No obvious drops. The usual image skew. That could be because the Kiwi I'm using isn't doing GPS clock correction. Or because the fax code I'm using isn't doing per-line sync properly (I don't fully understand how the code works yet).

After the extension is working better I can move it to the client side and feed it samples from different points to try and narrow down where the drops are occurring. Like in the decompression routine or the interpolator.

dev-zzo commented 7 years ago

Great job on the extension! As for synchronisation etc, you might glean an idea or two from my code here: https://github.com/dev-zzo/kiwiclient/blob/master/kiwifax.py Sorry it looks so hacky. :-)

jks-prv commented 7 years ago

Hacky? Are you kidding? That's a beautiful piece of work! It's obvious now I need to add a web socket API to get the other stream types, like what available to the server-side of extensions (the IQ stream in particular).

I didn't quite realize you had written fax code to go along with your client. I'm going to study that and try to understand why my code, which is based on another github project, doesn't work so well.

BTW, this v1.85 clean-up change breaks the API slightly ("AUD" -> "SND"). I'll try and send you a pull request (I'm still learning the fine points of git). https://github.com/jks-prv/Beagle_SDR_GPS/commit/d222536eea8e18f8a906077334f53144dae2cc57

dev-zzo commented 7 years ago

Thanks John, the API fix was rather trivial and has been applied. :-)

I am really looking forward to the IQ stream being available to client-side code!

dev-zzo commented 7 years ago

I have tested with yet another public KiwiSDR device; this time the breakage patterns were very different and unexpected. I've uploaded the samples for you at https://yadi.sk/d/2LwxdZiz3HuKTi as before.

jks-prv commented 7 years ago

In kiwifax.py there is a routine: def real2complex(x): return [ complex(x[i+0]-x[i+2], x[i+1]-x[i+3]) for i in xrange(0, len(x), 4) ]

With the calling code: X = real2complex(samples) sample_rate = self._sample_rate / 4

I've never seen that before. Does that really convert real samples into IQ with a downsampling by 4? Doesn't it really require an LPF at the sample_rate/4 afterwards to prevent aliasing? Like you would have to do if you mixed (complex multiplied) the real signal by the complex carrier sample_rate/4? Sorry, I'm fairly new to a lot of this DSP stuff.

jks-prv commented 7 years ago

I'm still trying to get my fax code to work better before chasing the shifting problem. But I have seen a few cases now where shifting occurs even when using audio samples straight from the server with code running on the Beagle (no network, no javascript on client). So this may well be a lower-level problem, which is good because that is something I have more control over.

dev-zzo commented 7 years ago

Regarding the real2complex() routine, yes, typically one would need an LPF to prevent aliasing, inserted before downsampling. But that we already have, in the form of audio filters on the server's side, so no biggie. Of course, there is a "correct" way to convert real samples to complex, but nothing beats this quick and dirty routine in terms of speed. :-D The idea is the same as the one behind the Tayloe mixer.

I was thinking if it is possible to do it in a better way, it should be doable so that the data rate is reduced not by 4 but by 2 instead, but I don't have a working PoC code yet.

dev-zzo commented 7 years ago

Naturally, I'll be happy to help you with your fax code -- please feel free to ask anything.

jks-prv commented 7 years ago
screen shot 2017-05-30 at 6 58 32 am

Thanks for the explanation. It turns out I was confused about a number of signal processing concepts. Things I understood better many months ago but had since forgotten the fine details of.

Interestingly I believe I have solved the WSPR image problem because of all this. The WSPR code needs an input signal at a sampling rate of 375 Hz. Long ago I made the audio bandwidth of the Kiwi 12 kHz so a simple power-of-two decimation by 32 could be used to get to 375 Hz. No filtering besides the main passband filter. Worked fine. Later I changed the audio b/w to 9600 Hz to be compatible with the S4285 decoder I was experimenting with. I changed the WSPR decimator to a simple "drop sampling" fractional one: 9600 / 25.6 = 375. Big, big mistake. You can't do fractional decimation that way. I wrote some signal generator code plus decimation and fed it into Baudline to observe the result. What a mess. I'm surprised WSPR worked as well as it did. I realize now you have to do proper resampling by rational (integer) interpolation up followed by similar decimation down with the appropriate filtering. So in this case 9600 * 5 = 48k, / 128 = 375, plus filtering. But that is too much additional computation for the WSPR extension. It was easier to simply go back to an audio b/w of 12 kHz.

But doing that puts increased timing / realtime load on the audio stream. Which in turn probably makes the fax lost sample problem worse. So I'm going to look into that issue now.

dev-zzo commented 7 years ago

I've taken time today to experiment with that audio to IQ conversion. The simple implementation above is really hacky and the quality is abysmal if you look at the resulting spectrum; this is especially obvious when looking at the black tone which has an in-band image spur only 5dB lower than the expected one. Not that it affects the image quality a lot, but there are visible artifacts (wavy patterns). It does its job as a really quick and dirty way to FM demodulate audio samples close for fS/4, but quality can be improved. I tried replacing is with "brute force" FFT-based code, which yielded higher quality images -- I suspect mainly due to the increased sample rate than spurs being eliminated. I could not measure by how much the load increased as it stays below 5% at any time. Unfortunately, new code is more susceptible to noise as there is no averaging now that'd eliminate at least some of it.

The screenshot above looks very nice indeed, you've made good progress! I think the next step could be to implement a resampler to adjust the image's width and remove slant -- and it's as good as some of the "commercial" implementations. :-)

jks-prv commented 7 years ago

I'm now using the same FM demod in the fax code as used in the Kiwi NBFM demod. It is code from the csdr library that is part of OpenWebRX. The demod is incredibly simple. I don't understand how it works. It's just Icur(Qcur-Qprev) - Qcur(Icur-Iprev), for normalized I & Q. Nothing else. I could never get the PLL-based FM demod code from CuteSDR to work. The original fax code I started with had a nasty asin() after the demod, but I found it wasn't needed. I could get a perfectly fine linear grayscale ramp for a 1500 to 2300 Hz tone without it. I'm probably missing something..

dev-zzo commented 7 years ago

That's actually a very nice insight, that formula. Noted down. :-)

Let Icur = cos(x+dx), Iprev = cos(x), Qcur = sin(x+dx), Qprev = sin(x) -- by definition of I and Q from the phase x.

Then:

Y = Icur*(Qcur-Qprev) - Qcur*(Icur-Iprev) =
= cos(x+dx)*(sin(x+dx) - sin(x)) - sin(x+dx)*(cos(x+dx) - cos(x)) =
= cos(x+dx)*sin(x+dx) - cos(x+dx)*sin(x) - sin(x+dx)*cos(x+dx) + sin(x+dx)*cos(x) =
[ note that the term cos(x+dx)*sin(x+dx) cancels out ]
= sin(x+dx)*cos(x) - cos(x+dx)*sin(x) =
[ by noticing it fits a certain trig identity ]
= sin(x+dx-x)
= sin(dx)

I guess this explains the asin() call in the original code. I quite like the idea behind this, especially if your samples come in already normalized.

jks-prv commented 7 years ago

Okay, I'm pretty certain the sample drop problem is just a case of the Beagle data pump code not being able to service audio transfer interrupts from the FPGA consistently enough. I freed up some FPGA memory with some recent optimizations. So I'm going to make the audio buffer larger and try a different transfer strategy.

jks-prv commented 7 years ago

The v1.91 release has 4 times the audio buffer in the FPGA, and some new sequence and histogram code to track the effectiveness of the increase. The small jumps seem to be fixed, although more testing is needed.

I still occasionally see large shifts. These seems to happen when the fax task on the Beagle (or audio output task for regular connections) gets severely delayed for some reason. I'm still trying to figure out why. On the Beagle side there is 650 msec of buffering which should be more than enough. It's a serious bug if tasks are getting delayed that long.

The fax extension is not ready for release yet, so it's not in the extensions menu. But you can get to it by specifying it in the URL, e.g. kiwisdr.local:8073/?ext=fax,7880 (7880 = Hamburg). Not many features besides start/stop/clear (file doesn't work yet). Click in the fax image and that x position will be shifted to the left margin (poor man's phasing). No shear alignment yet. No image scrolling, etc.

dev-zzo commented 7 years ago

The latest patches seem to fix the issue! I am so happy right now. :-) Thanks!

dev-zzo commented 7 years ago

Ah, I was too hasty in closing the issue. It does reproduce, although very rarely: within one hour, only a single image was broken.

jks-prv commented 7 years ago

Was it a big shift? Like 1/4 - 1/2 a scan-line shifted? That's what I'm seeing here. That problem seems not to be an FPGA buffering problem but this other problem I mentioned. Looking at that now..

jks-prv commented 7 years ago

Update: The kiwisdr.local:8073/?ext=fax thing I mentioned doesn't work because the fax code isn't in the v1.91 release like I thought. I'll clean it up a little and get it into the v1.92 release today.

dev-zzo commented 7 years ago

Yes, about 1/3 of a line was lost. I've uploaded a file named 20170604T1340Z_9982500.png to the shared folder above. So far this has been the only artifact on the record.