Overrun and hang after disconnecting

0xc0re commented 9 years ago

When a client closes a connection and there are no other clients listening, the ShinySDR hangs and outputs the following text:

INFO:shinysdr:Closing connection: '' (1001) INFO:shinysdr:Closing connection: '' (1001) OOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOO

I eventually have to 'kill -9' the process and restart it.

From what I have read, this is a buffer overflow issue related to gnu-radio. Is there any solution to this? I have attempted it on two different computers both with substantial resources.

Any help would be appreciated. I really like this software. You did a great job on it.

kpreid commented 9 years ago

'O's don't indicate a "buffer overflow". They indicate an overrun, that is to say the hardware is producing samples faster than the software is processing them. Under normal conditions, occasional overruns are harmless (other than to the signal being received), but continuous 'O's indicate trouble. In this case, since you say it hangs, the likely cause is that something is locking up the flowgraph. More troubleshooting will be required.

First, what happens if you try to exit with ^C (or kill -15) instead of kill -9? If that prints something, but doesn't finish exiting, then it's more likely to be on the flowgraph side of things; if there is no response at all, then something done from the Python side locked up.

Another thing to do is try running ShinySDR with the --force-run option; this will prevent it from shutting off the signal processing when there are no clients connected, at the cost of continuous CPU usage. Please try it and see whether or not this makes a difference.

Finally, we can start looking for where the hang occurs by scattering print statements. The place to start would be __start_or_stop in top.py and its callers.

0xc0re commented 8 years ago

Thanks for the quick reply. I really appreciate it. I tried the solutions you suggested. The flag '--force-run' had no effect, and ^C and/or 'kill -15' would never kill the process. In order to kill the process, I have to open another terminal or pause the process using ^Z and then 'kill -9' it.

Below is the output of the underrun: INFO:shinysdr:Closing connection: '' (1001) INFO:shinysdr:Closing connection: '' (1001) INFO:shinysdr:Flow graph: Rebuilding connections because: removed audio queue OOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOO^COOOOOOOOOOOOOOOOOOOOOOO^Z

I am attempting to use this on two machines. Both use the RTL2838 DVB-T dongle, but the up-converter in the two dongles differs. One dongle has no physical switch to switch to HF frequencies, while the other one does not.

This is the config file for the dongle with the physical switch for HF: http://pastebin.com/eVGDD36U This is the config file for the dongle with no switch: http://pastebin.com/3UYqYggf

I guess the next option would be to add some print statements to the top.py file like you mentioned. I will look at that file and add some, unless you have any other suggestions.

Again, thanks for the reply. This software is fantastic.

kpreid commented 8 years ago

I have never used direct_samp mode and there might be driver quirks I don't know about there, but you do have the other one without it that also fails.

I'm surprised that --force-run didn't make a difference, because broadly speaking it should remove the effect of clients leaving. However, there's still the flowgraph reconfiguration part, so another thing I'd like you to poke at is whether you can cause it to hang by adding and removing receivers.

Other than that, go ahead and work through the print statements. Since --force-run didn't make a difference, I'd direct attention to the lock/unlock mechanisms primarily over start/stop.

Oh, and do you have any tools that can provide a recording of the process's thread stacks while it's hung? That might be very useful in identifying the problem. On OS X we have spindump/Sampler but I'm not familiar with the options elsewhere.

0xc0re commented 8 years ago

I added print statements to the lock/unlock mechanisms I was able to find in top.py. Below is the output of the latest crash:

INFO:shinysdr:Starting RFC 6455 conversation INFO:shinysdr:Stream connection to /kc5hhq/radio _recursive_lock_hook(self) LINE 412 _recursive_lock_hook(self) LINE 412 OOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOO^@OOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOO^COOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOO^Z [1]+ Stopped python -m shinysdr.main --force-run kc5hhq.cfg

I'll add some more print statements, but this could be where it is halting. Unless it is crashing somewhere that I do not have a statement. I guess I could add the time to the print statement to see...

def _recursive_lock_hook(self): print '_recursive_lock_hook(self) LINE 412' for source in self._sources.itervalues(): source.notify_reconnecting_or_restarting()

k9wkj commented 8 years ago

this still happens to me in a non repeatable manner sometimes it happens if i let the demodulator run over night or it may not sometimes it happens when i close a demodulator or it may not it also has happened when closing the client with no demodulator running

2015-10-08 11:33 GMT-05:00 Christopher Story notifications@github.com:

I added print statements to the lock/unlock mechanisms I was able to find in top.py. Below is the output of the latest crash:

INFO:shinysdr:Starting RFC 6455 conversation INFO:shinysdr:Stream connection to /kc5hhq/radio _recursive_lock_hook(self) LINE 412 _recursive_lock_hook(self) LINE 412

OOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOO^@OOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOO^COOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOO^Z [1]+ Stopped python -m shinysdr.main --force-run kc5hhq.cfg

I'll add some more print statements, but this could be where it is halting. Unless it is crashing somewhere that I do not have a statement. I guess I could add the time to the print statement to see...

def _recursive_lock_hook(self): print '_recursive_lock_hook(self) LINE 412' for source in self._sources.itervalues(): source.notify_reconnecting_or_restarting()

— Reply to this email directly or view it on GitHub https://github.com/kpreid/shinysdr/issues/40#issuecomment-146606905.

kpreid commented 8 years ago

You'll need to get much more specific than that, of course; all this tells us is that it happens after _recursive_lock_hook is entered, not what 'primitive' (not part of ShinySDR) call did it. Most likely it will be the actual gnuradio lock() operation (still need to confirm this!), but that just means that one of the blocks involved in the flowgraph is causing trouble, and doesn't tell us which one.

The next step would be start cutting out pieces of the flowgraph and functionality until we have a minimal program that reproduces the problem. Unfortunately, this'll be harder for people that aren't me, of course, and yet I don't have this problem. (I've dealt with this class of problem, lockups inside of GR, but not one that is triggered like this and isn't fixed.)

kpreid commented 8 years ago

Closing due to inactivity.

kpreid / shinysdr

Overrun and hang after disconnecting #40