jks-prv / Beagle_SDR_GPS

KiwiSDR: BeagleBone web-accessible shortwave receiver and software-defined GPS
http://kiwisdr.com
474 stars 159 forks source link

Weird shift on WSPR occurring every once in a while #98

Closed elafargue closed 7 years ago

elafargue commented 7 years ago

Not sure what causes this - might be linked to frequency getting resynchronized to the GPS signal? But every once in a while, i can see a frequency shift occurring on the WSPR decoder as shown in the screen captures below. Is this expected ?

On those screenshots, I am not changing the VFO frequency, just monitoring, the shift occurs by itself.

screen shot 2017-04-27 at 12 36 49 pm screen shot 2017-04-26 at 1 37 04 pm screen shot 2017-04-27 at 12 32 18 pm screen shot 2017-04-27 at 12 35 34 pm
jks-prv commented 7 years ago

Yes, this problem was reported to me in an email. It happens when a "outlier" GPS position/timing solution causes a big jump in the corrected ADC clock frequency. Then the averaging slowly negates the effect of the one bad solution (I use an 8 period modified moving average). There is an outlier filter already (+/- 50ppm) which is the manufacturing tolerance of the ADC XO. It's wide enough to correct any initial XO offset due to temperature. But the filter really needs to be narrowed after that because no subsequent solution should exceed a few ppm. There is always some jitter in GPS solutions as you know (I should write an extension that shows a scatter plot of the accumulated GPS solutions).

Example: At the 66.7 MHz XO 50ppm is 3.3 kHz. At 7 MHz this scales to 350 Hz. So a worst case outlier at +/- 50ppm would cause the WSPR waterfall to jump by 350 Hz and then slowly recover. Completely unacceptable.

The fix is a little complicated because of the corner cases. What if there is no reception for a while (possibly because acquisition is off because there are active SDR connections and the existing tracked sats have all gone out-of-range) and the temperature in the room changes significantly? You really want to detect that case and widen the filter for one solution to recapture any temperature offset. Another one: what if the temperature solution itself happens to be an outlier? You don't want to accept it blindly and immediately narrow the filter. You'll get no more corrections after that because all the subsequent solutions are correctly outside the filter range. So the right thing to do for the initial temp correction is to take a sample of solutions and throw out any outliers.

Or you could change the entire algorithm instead. Like tracking the peak of a histogram of solution offsets instead of averaging non-outliers. Fun stuff! I am no expert in any of this. So I welcome your comments and ideas.

jks-prv commented 7 years ago

On second thought I think a simple histogram approach only works to correct a static offset, like the true manufactured XO frequency offset. If there is an offset change due to something dynamic like temperature change, then the histogram samples have to be time-weighted somehow. If the Kiwi has run for days in a constant temperature room, and as a result has built up a huge histogram peak, and then someone opens a window in the middle of winter the new, much smaller, temperature-shifted peak is not going to cause any correction unless the previous peak is somehow negated by virtue of being old. Apply an exponential decay to the samples or something.

I suppose this is all related to the impulse noise response of digital PLLs and PID controllers etc. since that might be another way to address the problem.

elafargue commented 7 years ago

Thanks for the explanation! not easy to solve indeed, but a nice challenge :) Maybe build a histogram table in 2 dimensions taking temperature into account to more easily discard outliers?

jks-prv commented 7 years ago

Yes, I think something like that would work well.

There is an unexpected benefit of doing this correction all in software and FPGA firmware instead of a traditional DAC + loop filter driving the control pin of a VCXO. I didn't even realize it at the time. It means each user, on each SDR channel, can have their own correction strategy. Or more likely the option to have no correction at all, while they are connected, to prevent disturbing their phase-sensitive applications. I have one user who makes extremely sensitive ionograms over long periods of time by VAC'ing the browser audio into his external program. He can show me on his plots similar effects to the WSPR waterfall glitches. So I really need to get this fixed..

jks-prv commented 7 years ago

Okay, the v1.84 release going out today fixes this problem.

It turned out to be a programming issue. There is some C code that reassembles a software copy of the 48-bit counter inside the FPGA that counts "ticks" of the 66.7 MHz ADC clock. This code sometimes failed in a very strange way having to do with sign extension when constructing 64-bit C values. The fix is simple but I still don't fully understand why it happens to begin with. It is demonstrated in tools/ext64.c Maybe someone can explain it to me. The failures sometimes produced corrected clock values that passed the +/- 50ppm filter window and caused the observed WSPR waterfall behavior.

The new code drops the window size to +/- 1ppm after initial correction to account for temperature offset. Then widens it to 50ppm again if there starts to be a number of sequential GPS solutions outside the window which could occur for example if there were no GPS solutions for a while and the temperature drifted. It seems to work pretty well.

The corrected clock value is now stored per-connection. So in the future when there are user preferences you'll be able to specify that you don't want incremental corrections that might disturb your phase-sensitive applications.

elafargue commented 7 years ago

Awesome, thanks for the fix!

jks-prv commented 7 years ago

Please let me know if you see any additional problems or have ideas for improvements.

I really need to do more studies about response to typical temperature changes. I let the XO get hit by direct sunlight as the sun came through a window during its normal progression across the sky (figuring that would be a typical scenario). The current algorithm seemed to track okay. Hitting it with a hair dryer was another story. But it was complicated by the fact that the GPS lost lock, probably because its TCXO exceeded 1.5ppm even inside the shield can.