mikebrady / shairport-sync

AirPlay and AirPlay 2 audio player
Other
7.33k stars 575 forks source link

tuneblade connection issues. #772

Closed gibman closed 5 years ago

gibman commented 6 years ago

hi.

I'm running shairport-sync 3.2.1 from the moode audio project.

I have 5 pi's running and they all exhibit the same misbehaviour.

I also have a windows PC running tuneblade 1.7.8.0

after a few hours it seems tuneblade is unable to establish a connection with around 80% of the pi's. tuneblade automatically retries the connection over and over again as can be seen in it's UI.

It works again for a few hours if I restart the shairport-sync process.

Maybe it's previous broken tuneblade connection that is still linkering on. tuneblade does crash from time to time and it could leave shairport-sync in a bad state maybe ? this is only me guessing here :)

logs attached from one of the faulting pi's. shairport-sync.log

mikebrady commented 6 years ago

Thanks for the interesting conundrum. It seems to kick off with the

Error writing an RTSP packet -- requested bytes not fully written.

message. I don't know what the cause of the problem is, but it's compounded by the fact that Shairport Sync will not abandon the now-defective session. Let me look into that.

mikebrady commented 6 years ago

Hi there. I've just pushed a development branch update which should close a session in the above circumstances. That should give the opportunity for a new connection to be established. I'd be obliged if you'd give it a try.

gibman commented 6 years ago

thanks!..

Ill try it out asap and then let a few days go by before concluding anything. Are you interested in the logs regardless ?

mikebrady commented 6 years ago

Yes please; if it works, I'd expect it to log the fact that it dropped a session when that problem occurs. If it doesn't work, it wold be interesting to see what's happening.

mikebrady commented 6 years ago

Looking at the log a bit more closely, I'm not sure the fix will work. So, just now, I've pushed another update with some more diagnostics in it. Hopefully it will give some indications as to what's causing Shairport Sync to hang when this error happens.

gibman commented 6 years ago

no problem. I just uploaded the new binary (with diags) on all 5 pi's :=)

lets wait and see what happens...

gibman commented 6 years ago

Ok.. all 5 pi's exhibit same issue already.

here's some info from moode audio: A I R P L A Y S E T T I N G S

Version         = 3.3d21
Friendly name       = Kitchen
ALSA device     = hw:0
Volume mixer        = software
Resume MPD      = No
Output bit depth    = S16
Output sample rate  = 44100
Session interruption    = yes
Session timeout (ms)    = 120
Audio buffer (secs) = 0.0

I hope the log is sufficient.

shairport-sync.log

mikebrady commented 6 years ago

Thanks. I’m looking at the log on an iPhone and the first part of it, up to “Connection 184: new connection from 192.168.1.127:50127” is garbled. Is it me or is it the log file?

gibman commented 6 years ago

aghh.. it looks like I haven't enabled debug level logging.

hang on...

gibman commented 6 years ago

this is my how shairport-sync is started.

root 20639 1.0 0.7 8240 3232 ? S 19:19 0:00 sudo /usr/local/bin/shairport-sync -vv -a Kitchen -S soxr -w -B /var/local/www/commandw/spspre.sh -E /var/local/www/commandw/spspost.sh -- -d hw:0

root 20647 10.8 1.6 81380 7436 ? Sl 19:19 0:00 /usr/local/bin/shairport-sync -vv -a Kitchen -S soxr -w -B /var/local/www/commandw/spspre.sh -E /var/local/www/commandw/spspost.sh -- -d hw:0

so is the '-vv' parameter enough to get the wanted logging enabled ? this is what the logs attached on the previous reply is based upon.

waiting for your feedback.

mikebrady commented 6 years ago

Thanks. Yeah, that’s perfect. The problem was that there seemed to be a great deal of garbage characters in the start of the log. Maybe it’s the phone — I’ll get a chance to use a laptop later on.

gibman commented 6 years ago

yeah its garbled at my end also.

Ive included another log from my livingroom pi. seems to be the same issue regarding the garbled log.

I eyeballed the file manually over ssh/putty: more /var/log/shairport-sync.log

the more command just display blanks instead of garbled stuff.

the file was retrieved using winscp.

livingroom.shairport-sync.log

gibman commented 6 years ago

ok .. I deleted the log on the pi and restarted the process.

now I think we have a complete log to look at.

livingroom.shairport-sync2.log

mikebrady commented 6 years ago

Thanks for your efforts. I can read that last one alright.

mikebrady commented 6 years ago

Okay, I’ll have to dig into this — I think you’ve given me enough information to try it here. I’ll keep you updated on progress. Meantime, I wonder what would happen if you disabled allowing a session to be interrupted?

gibman commented 6 years ago

no problem.

I have disabled session interruption now.

logs attached, same issue.

shairport-sync.log

mikebrady commented 6 years ago

I'm getting it to misbehave alright, but haven't nailed it down yet...

mikebrady commented 6 years ago

I've noticed a few small issues and fixed them on a development version I've just pushed. TBH, I don't think they have fixed the problem, but no harm to try them out -- I haven't gotten a crash since.

I was in touch with the [very helpful] developer of TuneBlade about a new ("POST") message I noticed coming from TB. He replied with this:

Apologies for the POST message. I added this recently for some receivers which required it. Other receivers I tested simply ignored this request, so I released it. Later, I got reports from users complaining about breaking compatibility with some devices. So, I've added a check to make sure I send the request only to compatible receivers. The updated version can be found here:

http://files.breakfreeaudio.com/downloads/TuneBlade_Installer_1_7_9_0.exe

Feel free to share this link with anyone. I was hoping to add more features before making a release, but I think I'm going to release it without any further delay.

TBH, I don't think the extra message is causing the problem, but again, it can't hurt...

gibman commented 6 years ago

cool.. thanks a lot.

I'll put on the new version of TB as well as a fresh compiled version of shairport-sync asap.

You will hear from me ...

gibman commented 6 years ago

Here's the latest log showcasing the same issue. Running on latest TB as well as the latest SP code.

shairport-sync.log

mikebrady commented 6 years ago

Thanks. But it seems to have those garbage characters at the start again.

mikebrady commented 6 years ago

BTW, I've just pushed a small improvement to SPS, which just might be important. It imposes a timeout on SPS writing status information back to the source. This should remove one possible cause of hanging up.

gibman commented 6 years ago

Very nice.

I will try it out asap.

Keep you posted.

gibman commented 6 years ago

Regarding the garbled log.

Seems to appear after some time. Maybe if the log gets above a certain threshold?

mikebrady commented 6 years ago

Yes, I don't really know, TBH. The thing is, whatever happened, it seems to have happened up there in the garbled part.

mikebrady commented 6 years ago

I'm actually a little intrigued by the garbling of the log and I wonder if it could be connected in any way with the problem. A scenario that might account for all SPS devices having the same problem might be if they are all affected by some common flaw. So, is it possible that they all have the same brand of power supply, or are using the same brand of SD card and that it might be problematic? A bit of a long shot, TBH.

gibman commented 6 years ago

They all have different brands of 8 GB sd cards. They also have different types of USB chargers and cables.

A few weeks ago I installed a vanilla image of raspian stretch on one of the pi's and hooked it up to the telly.

I seem to remember it coming up with textual warnings about undervolting. More discussion here: https://www.raspberrypi.org/forums/viewtopic.php?t=210914

Remember, I have had the pi's working on raspian Jessie + an older version of shairport-sync. It could run for days. But once in a while the issue could occur.

The issue is occurring within an hour when running the latest shairport-sync.

regards gibman

gibman commented 6 years ago

fresh logs, issue just happened.

shairport-sync.log

gibman commented 6 years ago

and logs for another device (bar)

edit: looking at the logs in L:374 we have the connection 2 teardown being initiated by tuneblade. but it looks like the thread never completes as we have :

Connection 2: TEARDOWN player_thread cancel... player_thread join... Connection 3: new connection from 192.168.1.127:2150 to self at 192.168.1.118:5000.

Looking elsewhere in the log for successful tear downs, and they look like this: player_thread cancel... player_thread join... player_thread joined. pend

What seems to be missing in the connection 2 TEARDOWN : player_thread joined. pend

could it be as simple as a deadlock, preventing further connections ?

Or am I on a wild goosechase here?

bar.shairport-sync.log

mikebrady commented 6 years ago

Thanks. Your analysis is correct, I believe. The question is where the deadlock might be. I’m trying to replicate it here. From the information you gave earlier, it doesn’t seem as if there is some other cause that might be causing the problem. I’ll keep digging!

mikebrady commented 6 years ago

Something that would be useful is to enable delta timing in the logs. For some time, Shairport Sync has been able to show the time an event is logged and/or the time since the last log entry. The settings are in the diagnostics past of the configuration file (you can see an example in /etc/shairport-sync.conf.sample). If you could set log_show_time_since_last_message = "yes” (don’t forget to uncomment it!) that’s would be very useful.

gibman commented 6 years ago

done :)

Ive also bumped verbosity to level 3

It seems like tuneblade is able to establish the first connection. this happens when I play some audio related on the host running TB. Once I pause the audio on the PC then TB kicks into standby mode. If I try and have TB connect again, simply by starting the same audio then we have the problem.

Now looking at the log I attached, it seems deadlocked at the last line. The log will not grow beyond this point even though TB tries and tries to connect.

I would think that under normal circumstances the log would be filled with these connection tries.

I even tried to restart TB on the windows PC, same stuff. Log is stuck. so it seems deadlocked for sure here.

did you try and install TB and reproduce ?

I guess one could look at a wireshark dump as well, if this helps ?

kitchen.shairport-sync.log

mikebrady commented 6 years ago

Thanks again, that's great -- the timings are useful. Indeed I have been using TB here, but I'll try the scenario you describe. No need for Wireshark just yet, thanks!

mikebrady commented 6 years ago

An interesting thing that I'm trying to think through is that there's a pretty fast sequence here:

         0.000156999|Connection 1: RECORD
         1.309847869|pbeg
         0.000983993|RTSP thread 1: RTSP Response:
         0.000211998|  Type: "CSeq", content: "4"
         0.000141999|  Type: "Server", content: "AirTunes/105.1"
         0.000141999|  Type: "Audio-Latency", content: "11025"
         0.004631964|RTSP thread 1 received an RTSP Packet of type "SET_PARAMETER":
         0.000260998|  Type: "CSeq", content: "5"
         0.000143999|  Type: "Session", content: "1"
         0.000260998|  Type: "Content-Type", content: "text/parameters"
         0.000503996|  Type: "Content-Length", content: "19"
         0.000146999|  Type: "User-Agent", content: "TuneBlade/1.0 (Windows)"
         0.000143999|  Type: "Client-Instance", content: "56B6CB929BB20486"
         0.000140998|  Type: "DACP-ID", content: "13EE26B2E036204E"
         0.000137999|  Type: "Active-Remote", content: "1983426473"
         0.000827994|RTSP thread 1: RTSP Response:
         0.000208998|  Type: "CSeq", content: "5"
         0.000141999|  Type: "Server", content: "AirTunes/105.1"
         0.003446974|RTSP thread 1 received an RTSP Packet of type "TEARDOWN":
         0.000258998|  Type: "CSeq", content: "6"
         0.000151998|  Type: "Session", content: "1"
         0.000372998|  Type: "User-Agent", content: "TuneBlade/1.0 (Windows)"
         0.000167998|  Type: "Client-Instance", content: "56B6CB929BB20486"
         0.000144999|  Type: "DACP-ID", content: "13EE26B2E036204E"
         0.000139999|  Type: "Active-Remote", content: "1983426473"
         0.000158999|Connection 1: TEARDOWN
         0.000162999|player_thread cancel...
         0.000399996|player_thread join...

A RECORD message comes in, basically the start of a session. It takes 1.3 seconds for the external command to complete, and then a flurry of messages -- probably setting the volume -- are followed by a TEARDOWN, all in a few milliseconds. I'll have to think a bit about that...

mikebrady commented 6 years ago

When a Pi hangs up again, could you run top or htop to see if any core/cpu is stuck at 100% please?

gibman commented 6 years ago

I'll do asap.

one thing Ive noticed is that killing the process by: pkill -f shairport-sync

does not infact kill it. it still remains.

doing a sudo reboot also takes forever.

Might be because the process is hanging ?

gibman commented 6 years ago

btw .the TB issue is even more simple to repro. than the one I originally described.

1) start TB 2) wait until it says connection - standby 3) close TB 4) start TB

now it says connecting .. it does not come any further.

gibman commented 6 years ago

nothing to see regarding cpu usage. it only spikes during connection it seems.

mikebrady commented 6 years ago

Thanks for those posts. I have noticed how hard it is to kill it. Once I know its Process ID I can use:

kill -9 <pid>

and it normally works.

mikebrady commented 6 years ago

I can't reproduce the problem on my rig, so let me check your settings, please.

  1. Do you automatically connect to the devices on startup?
  2. And do you automatically reconnect on error?
gibman commented 6 years ago

yes I have..

start at login autoconnect to receivers when discovered force reconnection on connection error

also in the airplay receivers section I have the checkbox 'autoconnect' enabled for each pi.

gibman commented 6 years ago

I tried to disable those above settings and then manually within TB hit the playback/stop button in turn. Effectively this disconnects and reconnects on demand. Ive tried now 20times or so manically turning the connection on off and it does not produce the error.

I even tried and restart TB with these new settings, and the issue does not seem to appear. Also tried to forcefully kill TB process and then restarting TB, still no sign of the issue.

food for thought huh ?

Is it time for wireshark ? I guess this needs to be installed on the TB host, as the traffic is not multicast ? I cant capture it on some random laptop here.

I'll let the pi's + TB set for a while with these options off and then see how it goes. Not having those options would have the sideeffect of me having to babysit the TB control panel and then manually connecting the PI's that have caused some hickup.

mikebrady commented 6 years ago

Thanks again for all your efforts. At my end, I've found another error and pushed a fix to it. Basically, if SPS was flushing a sequence of audio packets and found a spurious "unready" packet in the sequence while doing so (possibly caused by a missing packet), it would get into an infinite loop that it would not cancel out of. You'd just never know if that condition is occurring in your situation, so I'm not too hopeful. But, it's good housekeeping to fix it. Whenever you got a chance to update to it, it would be great.

gibman commented 6 years ago

Tried new version.

same :(

kitchen.shairport-sync.log

mikebrady commented 6 years ago

Thanks. This is my fault for not updating the version number, but, are you certain the log is from the latest version? If it is, it’s interesting!

mikebrady commented 6 years ago

Actually, I can see that it is the latest version, thanks.

gibman commented 6 years ago

Here's hopefully a more interesting log with log level set to 3.

kitchen-level3-shairport-sync.log

gibman commented 6 years ago

This one looks suspicious: Nov 19 15:43:54 kitchen shairport-sync[20934]: 0.000133998|Join audio thread.

It seems stuck here, unable to release the audio thread.

The log above contains a lot of initial successful connect-disconnect sessions. Maybe about 20 good ones before one bad appeared.

/gibman

mikebrady commented 6 years ago

Great, thanks, I'll have a look. I'm getting ready to push something with a bit more diagnostics shortly. No big developments at this end, though.

mikebrady commented 6 years ago

Very interesting log thanks. BTW, the new code yesterday for handling "unready" frames is being used!