Trunk recorder stops recording after a few hours

jamesaisenberg commented 3 years ago

Thanks so much for all the work that's gone into trunk-recorder!

I've been having an issue where trunk-recorder stops recording after a few hours. The error message I am getting is this, repeated every second or so:

[2021-08-30 17:29:38.010714] (info)   Pll Phase: -2.78317 min Freq: -6.283185e-01 Max Freq: 6.283185e-01
[2021-08-30 17:29:38.010768] (error)   [kcers3] Retuning to Control Channel: 8.531750e+08
[2021-08-30 17:29:38.010779] (info)      - System Source 0 - Min Freq: 8.500500e+08 Max Freq: 8.599500e+08
[2021-08-30 17:29:38.010831] (error)   [kcers3]  Control Channel Message Decode Rate: 0/sec, count:  0

Exiting and starting trunk-recorder fixes the problem. I am using the latest docker image, and my hardware is a BladeRF 2 xA40.

Thanks!

bctrainers commented 3 years ago

Could you post your json config in a code block? We can begin from there after. :-)

jamesaisenberg commented 3 years ago

Sure! Here it is:

{   
    "ver": 2,
    "sources": [
        {   
            "center": 855000000.0,
            "rate": 10000000.0,
            "error": 0,
            "ppm": 0,
            "agc": false,
            "gain": 14,
            "ifGain": 16,
            "bbGain": 32,
            "analogRecorders": 24,
            "digitalRecorders": 12,
            "driver": "osmosdr",
            "device": "bladerf,fpga=/app/fpga.rbf"
        }
    ],
    "systems": [{
        "control_channels": [853175000,853675000,854362500,851412500],
        "type": "smartnet",
        "bandplan": "800_reband",
        "talkgroupsFile": "talkgroups.csv",
        "talkgroupDisplayFormat": "id_tag",
        "shortName": "kcers3",
        "recordUnknown": true,
        "apiKey": XXX,
        "pyScript": "tester",
        "modulation": "fsk4",
        "squelch": -49,
        "analogLevels": 10,
        "digitalLevels": 4
    }],
    "callTimeout": 5,
    "uploadServer": "https://api.openmhz.com",
    "defaultMode": "analog"
}

bctrainers commented 3 years ago

If memory serves right, you should have the ability to console-into/query the BladeRF. If so, mind posting the following command results?

info - remove the serial value.
print clock_sel
version
print hardware

With print hardware I'm more curious to know your temperature reported. You may have frequency drift happening, resulting in the given issue you're seeing.

bctrainers commented 3 years ago

Keep in mind, the BladeRF has a built-in TCXO on the board. There is a chance it's malfunctioning or has a calibration error. I think it's within a 1 ppm error range, any more than that - there's an issue.

jamesaisenberg commented 3 years ago

Here it is. The only reason why I'm skeptical of a thermal issue is because stopping and restarting trunk-recorder immediately fixes the problem, and the device should be pretty much at the same temperature before vs after.

bladeRF> info

  Board:                    Nuand bladeRF 2.0 (bladerf2)
  Serial #:                 XXX
  VCTCXO DAC calibration:   0x1c43
  FPGA size:                49 KLE
  FPGA loaded:              yes
  Flash size:               32 Mbit
  USB bus:                  2
  USB address:              3
  USB speed:                SuperSpeed
  Backend:                  libusb
  Instance:                 0

bladeRF> print clock_sel

  Clock input:     Onboard VCTCXO

bladeRF> version

  bladeRF-cli version:        1.8.0-git-c8320f71-ppafocal
  libbladeRF version:         2.2.1-git-c8320f71-ppafocal

  Firmware version:           2.3.1
  FPGA version:               0.11.0 (configured by USB host)

bladeRF> print hardware

  Hardware status:
    RFIC status:
      Tuning Mode:  Host
      Temperature:  30.7 degrees C
      CTRL_OUT:     0xf8 (0x035=0x00, 0x036=0xff)
    Power source:   USB Bus
    Power monitor:  4.824 V, 0.57 A, 2.76 W
    RF routing:
      RX1: RFIC 0x0 (A_BAL  ) <= SW 0x0 (OPEN   )
      RX2: RFIC 0x0 (A_BAL  ) <= SW 0x0 (OPEN   )
      TX1: RFIC 0x0 (TXA    ) => SW 0x0 (OPEN   )
      TX2: RFIC 0x0 (TXA    ) => SW 0x0 (OPEN   )

bctrainers commented 3 years ago

Hmmm, looking at the configuration details, things look to be correct. You may consider updating the firmware/data on the blade; https://github.com/Nuand/bladeRF/releases

Is that temperature with the BladeRF at an idle state, or did you kill the recorder process then run the command?

I'm inclined to say it's a clock issue due to thermal drift due to the PLL Phase segment in the TrunkRecorder console log.

Also, I am not sure how the docker image pulls the TrunkRecorder stuff - so not sure which git revision that you're on. You're running on the correct USB mode for the Blade, so that's good (ample bandwidth).

robotastic commented 3 years ago

So, I maybe remembering things wrong, but there could be some lurking issue with the Smartnet trunking channel decoder. I think there is something in the decode graph that can just get stuck. There is a horrible hack that works... and I use. You can try adding "controlRetuneLimit": 10, to the high-level settings in the config.json. If it fails to successfully retune the control channel after 10 tries, it will exit.

If you use a script like this: https://github.com/robotastic/trunk-recorder/blob/master/examples/auto-restart.sh You can then just auto-restart the instance.

This is of course sub-optimal. I would love to hunt down the actual cause. I think it may just be that the SmartNet decoder is more sensitive to slight tuning errors. It might be worth trying to add +/-100Mhz to the error setting. It is far fetched, but maybe try a different bandwidth? It could also be that some of the filter parameters do not work cleanly at 10MHz, might be worth trying 12MHz or 16MHz.

theficus commented 2 years ago

I’m having the exact same issue, but with a new setup using two RTL-SDR dongles. Like with OP, it doesn’t seem to be a heat-related drift issue. My RTL-SDRs have TCXOs, and restarting trunk-recorder fixes it right away.

I’m a newbie to all this, and I was originally trying to set this up on a Raspberry Pi 3 B. What I would find is it would work for about 5 minutes then just kind of stop receiving control messages without any rhyme or reason.

I thought it was some misconfiguration on my end, so I spent a whole bunch of time trying different things. I ran rtl_test against the two dongles and even after running for about an hour both were around -1/-2 (in fact, changing ppm in the config to anything other than 0 causes the whole thing to stop getting control messages). I basically have LOS with the tower, and even with a cheap antenna mounted indoors in a very noisy area for RF about 20db is the sweet spot for required gain. The RTL-SDR dongles aren’t even particularly hot when things get stuck.

For the sake of comparison, I tried the exact same setup with exact the same build of trunk-recoder on my Mac and it worked great… for about an hour and then the same thing happened again. Ugh.

Like OP, I am trying to trunk track a SmartNet system — in fact I’m also trunk tracking KCERS (just a different site). When things work, they work great. It’s just weird that things just kinda get stuck after a little while.

I am hoping to use my Raspberry Pi to drive this setup, so the restart workaround isn’t really tenable if it gets stuck every few minutes on that platform. :) If there’s any more information I can collect or any debugging I can do that might help get to a fix for this, I’m happy to help!

theficus commented 2 years ago

Update: after spending a bunch of time tweaking my settings and antenna position, my Pi 3 recorder now runs for a few hours without getting into the stuck state so it's basically parity with my MacOS experience. Still a bummer that it needs the auto-restart back in place, but overall reliability is acceptable.

If there's anything I can do to help get this fixed once and for all, please just let me know what I can do. I am happy to do gdb-based debugging if you can tell me what I should be looking for. :)

robotastic commented 2 years ago

Yea - I wish there was a better solution. My bet is that it has something to do with this GNU Radio block somehow getting lost.... it is definitely one of the blocks in this file!

https://github.com/robotastic/trunk-recorder/blob/master/trunk-recorder/systems/smartnet_trunking.cc#L173

It looks like you might be able to use Symbol Sync instead... but I have no idea how to do that.

https://github.com/gnuradio/gnuradio/issues/5556

theficus commented 2 years ago

Just an update. I'm using the latest builds of trunk-recorder and perhaps most importantly I've switched to a different antenna tuned for 800-900MHz and added a LNA into the chain to help compensate for a suboptimal RF environment.

Since doing all this and tweaking the config a little more, trunk-recorder has managed to run for 4 days straight recording some 10000 calls without a single restart or hiccup. It's continually pulling a 35-45 decode rate and I haven't seen it drop lower than that a single time.

I'm thrilled it's working better now, but as an engineer at heart it's frustrating to not really know what ultimately "fixed" things. Was it something that just happened to be fixed in a newer build of trunk-recoder, gnuradio, or op25? Was having the better antenna and pre-amplifier (and thus ability to drop the SDR gain from 25 down to 7) reducing heat-related frequency drift? (I'm doubtful, as the SDR has a TCXO and because after a restart it would immediately lock back on to the control channel.) Or maybe the better antenna setup compensated for some other internal issue by having a more consistently "good" signal rather than one that would ebb and flow so whatever was triggering this issue was less likely to happen.

I'm unsure, but it seems to be working great for me now. 🤷‍♂️

theficus commented 2 years ago

I'm unsure, but it seems to be working great for me now. 🤷‍♂️

Update: seeing this happen again, but it's maybe like once or twice a day. Far better than the a couple times an hour originally. When it restarts, it immediately locks on to the control channel so I'll lose at most maybe a minute or two worth of calls.

robotastic commented 2 years ago

Totally - this has been vexing me. One guess is that a questionable control signal somehow locks up some of the GNU Radio blocks. There haven't been any code changes recently that would change things for smartnet decoding.

riverar commented 1 year ago

[2023-04-09 16:16:25.007705] (info)   Pll Phase: 3.1173 min Freq:  -0.000001 MHz Max Freq:   0.000001 MHz
[2023-04-09 16:16:25.007825] (error)   [KCERS]   Control Channel Message Decode Rate: 0/sec, count:  0
[2023-04-09 16:16:28.010110] (error)   [KCERS] Retuning to Control Channel: 853.450000 MHz
[2023-04-09 16:16:28.010249] (info)      - System Source 0 - Min Freq: 847.912000 MHz Max Freq: 857.812000 MHz

Am seeing this issue too on a standalone box, on a HackRF w/ a TCXO and LNA. Every time I stop trunkrecorder and restart it (or look manually via gqrx), everything is okay.

Happy to provide any data/tracing needed.

{
    "ver": 2,
    "logFile": false,
    "logLevel": "info",
    "debugRecorder": false,
    "audioStreaming": false,
    "sources": [{
        "center": 852862000.0,
        "rate": 10000000.0,
        "ppm": 0.0,
        "gain": 14,
        "ifGain": 24,
        "bbGain": 20,
        "antenna": "TX/RX",
        "analogRecorders": 10,
        "driver": "osmosdr",
        "device": "hackrf=936e63",
        "defaultMode": "analog"
    }],
    "systems": [{
        "audioArchive": true,
        "control_channels": [853425000, 851812500, 853450000, 856887500],
        "type": "smartnet",
        "squelch": -75.0,
        "modulation": "fsk4",
        "talkgroupsFile": "tg.csv",
        "bandplan": "800_reband",
        "recordUnknown": false,
        "shortName": "KCERS",
        "audioArchive": true
    }],
    "plugins": [
      {
        "name":"simplestream",
        "library":"libsimplestream.so",
        "streams":[{
          "TGID": 0,
          "address": "1.2.3.4",
          "port": 9123,
          "sendTGID": false
          }]
      }
    ]
}

robotastic commented 1 year ago

I would love to find a solution for this - I have just been using the auto restarter if it loses the control channel for too long... it is hacky, but works.

riverar commented 1 year ago

@robotastic Oh, didn't know that existed. Can you point me the right direction? (I know there's a shell script in a loop but that covers the process crashing.)

taclane commented 1 year ago

Oh, didn't know that existed. Can you point me the right direction? (I know there's a shell script in a loop but that covers the process crashing.)

There's a "secret" config setting (not that secret, just hasn't made it into the documentation). At the top level of the config.json set something like: "controlRetuneLimit": 5,

If it can't find a control channel after X attempts (pick a number at least as high as the number of control channels), trunk-recorder will quit, and either a shell script or systemd service can fire it back up.

rosecitytransit commented 1 year ago

I wonder if this and #792 are related? Have you tried the fix in the cc-retune branch?

theficus commented 1 year ago

I see this and I only have one control channel set.

robotastic commented 1 year ago

There's a "secret" config setting (not that secret, just hasn't made it into the documentation). At the top level of the config.json set something like: "controlRetuneLimit": 5,

this is a good reminder to get that documented... This is the script I use for auto-restarting: https://github.com/robotastic/trunk-recorder/blob/master/examples/auto-restart.sh

rosecitytransit commented 1 year ago

this is a good reminder to get that documented

I can do that if you want. I have an entry for the logDir option (actual feature was added in https://github.com/robotastic/trunk-recorder/commit/5177bcee5012d6189192fe669043f343b0a74d49) and some minor changes to submit. Want to submit it along with an option to log the decoding rates to a file (in place of the controlWarnUpdate idea I proposed). Should have it soon.

evanconnelly commented 1 year ago

Hi all! I just setup trunk-recorder for the first time. I seem to be running into the same or a very similar issue. It happens even more frequently for me--roughly every two minutes. With RTL-SDRs in my case.

[2023-07-28 19:49:05.016578] (error) [aacounty] Retuning to Control Channel: 854.412500 MHz [2023-07-28 19:49:05.016752] (info) - System Source 0 - Min Freq: 854.211875 MHz Max Freq: 856.131875 MHz [2023-07-28 19:49:05.016860] (info) Pll Phase: 3.79648 min Freq: -0.000001 MHz Max Freq: 0.000001 MHz [2023-07-28 19:49:05.016995] (error) [aacounty] Control Channel Message Decode Rate: 0/sec, count: 0

I'm brand new so maybe I'm missing something obvious in my config. I could try to set controlRetuneLimit and have tr terminate/restart, but with how frequently this is happening, I'm thinking I should try to troubleshoot the root cause. Any thoughts on where to check?

{
    "ver": 2,
    "sources": [

        {
            "center": 855171875,
            "rate": 2048000,
            "error": 700,
            "gain": 30,
            "digitalRecorders": 4,
            "analogRecorders": 4,
            "driver": "osmosdr",
            "device": "rtl=1"
        }, {
            "center": 856513400,
            "rate": 2048000,
            "error": 700,
            "gain": 30,
            "digitalRecorders": 2,
            "analogRecorders": 2,
            "driver": "osmosdr",
            "device": "rtl=2"
        }
    ],
    "systems": [{
        "control_channels": [854337500,854412500,854812500,855162500],
        "type": "smartnet",
        "talkgroupsFile": "/home/evan/trunk-recorder/aa-p16.csv",
        "modulation": "fsk4",
        "squelch": -60,
        "analogLevels": 8,
        "digitalLevels": 2,
        "shortName": "aacounty",
        "apiKey": "REDACTED"
    }],
    "uploadServer": "https://api.openmhz.com"
}

robotastic / trunk-recorder

Trunk recorder stops recording after a few hours #504