Too long restart on rpi2, 3

a-x- commented 6 years ago

Apprx. 1 munite restart via systemctl

Happened between 1.1.1-23-gd65b8e8 (i'm not sure) and 3.1.7 versions

mikebrady commented 6 years ago

Sounds good, thanks.

davidhq commented 6 years ago

checked version 2.8.6, noticed the same problem on current (now old) network configuration
I upgraded the network, added Unifi UAP-AC-LR access point to cover half of the house
The same problem didn't appear (yet), except what I sent you last time
I also added Edimax EW-7612UAN(V2) - N300 Wifi USB Adptr + Antenna to one of RPis, network was constantly performing at around 90 Mbits/sec

Things seemed rather ok but not really, yesterday I experienced this: https://gist.github.com/davidhq/35b1431337e40aa72ba7df1509fa419b

And today again on two separate RPis, "lab" with Edimax adapter and "ela" without built-in wifi..

https://gist.github.com/davidhq/c64dd56dc156a0dc3a32d628040e042e Jun 11 20:00:22 lab shairport-sync[461]: Shome mhistake shurely: very large number of frames to drop: 225793 -- setting it to 132300.

https://gist.github.com/davidhq/abf6c7e021686029023d3c627b25bb88

(still hasn't terminated the thread as I'm writing this report, but sending log anyway).

Maybe something can be seen from this?

I'm not helping with this just for my personal setup, I want to test as many edge cases as possible and help make shairport-sync more robust... The reason is that I'm making some helper software around it and a couple of friends are already using it so that's why I am and plan to be quite dedicated to this.

I hope there is not some mistake from my part / setup but I'll figure this out as well.

For now I think I cannot upgrade the network further except add this adapter to the other RPis that are not immediately next to AP... They are on the way from amazon. I'm also new to configuring this (second wlan).

This for now. If you think it's better to send you further reports to email, please write to david.krmpotic@gmail.com, otherwise here is ok as well.

davidhq commented 6 years ago

PS: regarding observing the statistics inside shairport-sync, I'll try that soon...

mikebrady commented 6 years ago

Thanks for you help, and it's great to push the application hard. The "shome mhistake shurely" messages are often indicative of a challenging network environment. The output from the statistics option would be really useful at this point to give an indication of the proportions of resends and so on.

mikebrady commented 6 years ago

Hi David. Your logs do indicate that there are problems remaining with terminating a play session reliably under some circumstances. Sadly, I can't reproduce the conditions that cause them. So, two things: one, I think enough progress has been made to move the fixes into a Release Candidate and, two, it seems to me that we should be moving to a slightly different system for terminating a player thread. It will take some time -- time I won't have for another couple of weeks. So, if you could show us some output with statistics enabled, that would be great, and meantime, as I get time, I'll start transitioning to a new thread termination system.

davidhq commented 6 years ago

Hello !

Here logs with stats:

Kitchen: https://gist.github.com/davidhq/6eb2a516ae70929eb797921a178eca72

music stuttered about 1 minute before I saved the log..

not sure if waiting for a mutex, maximum expected time of 30000 microseconds exceeded "player.c:802". debug_mutex_lock at "player.c:802" expected max wait: 0.030000000, actual wait: 0.030508996 sec. is indicative of this?

Lab log: https://gist.github.com/davidhq/6a84833f08778af0a78a90751d861cbb

For lab I now do suspect some strange intermittent issues... Kitchen it should not happen... I moved RaspberryPi in Lab on top of the speaker, before it was in front so that speaker was between the RPi and the Access Point. Not sure if magnetic field from speaker magnet can do something to the signal... will have to learn more about this... or I simply have a blind spot on this spot.. there is another RPi in the same room which doesn't seem to drop speed to 5MB/s like this "lab" one does. So this seem to be a separate issue for me to figure out.

Regarding stuttering in general, I also tried setting audio_backend_buffer_desired_length_in_seconds = 0.25 from default 0.15 for kitchen, didn't help... maybe the CPU is fast enough and something happens on the wifi or with processing of the wifi data.

Please let me know what is visible in the logs and thank you

mikebrady commented 6 years ago

Thanks for these. I don't have time to study them in detail yet -- it'll have to be mid-week next week. However, the Kitchen log is showing statistics that show that the sender's clock is way off -- look at all the adjustment of around -800 ppm. This is way outside what would be normal now. I've seen Macs needing adjustment of 130 ppm, but I've only ever heard of this before with some old Dell laptops. The other thing is that somehow lots and lots of UDP packets are being lost -- it seems that almost 10% of UDP packets are being lost. For example, at the end of the log, it seems that 57,008 resend requests were made for a total number of packets of 674,016. That is extremely high. To eliminate the possibility that it's a faulty source, can you try another, e.g. an iPhone or an iTunes machine?

davidhq commented 6 years ago

Hmm interesting, however all clocks are in sync via ntp, I also checked now...

I use two instances of forked-daapd: on a beefy server on LAN and another on one of the Raspberries.

I also tried streaming from a Mac:

Jun 16 16:29:07 outside shairport-sync[14542]:        3.0,    -723.1,     723.1,        1003,      0,      0,      0,      0,   6140,  246,  263
Jun 16 16:29:15 outside shairport-sync[14542]:        1.4,   -1084.8,    1084.8,        2006,      0,      0,      0,      0,   6076,  258,  262
Jun 16 16:29:23 outside shairport-sync[14542]:        1.1,    -861.1,     861.1,        3009,      0,      0,      0,      0,   6109,  257,  262
Jun 16 16:29:31 outside shairport-sync[14542]:        1.0,    -914.9,     914.9,        4012,      0,      0,      0,      0,   6079,  257,  262
Jun 16 16:29:39 outside shairport-sync[14542]:        1.0,    -863.9,     863.9,        5015,      0,      0,      0,      0,   6072,  258,  263
Jun 16 16:29:47 outside shairport-sync[14542]:        1.0,    -917.7,     917.7,        6018,      0,      0,      0,      0,   6074,  258,  263
Jun 16 16:29:55 outside shairport-sync[14542]:        1.0,    -878.0,     878.0,        7021,      0,      0,      0,      0,   6074,  257,  262

net correction in ppm is similar as before (too high), but there is are no resend requests.

There is a lot of resend requests in all other cases (streaming from both instances of forked-daapd, also both are on a wire).

I checked for UDP packet loss with iperf3, most of the times there was 0% loss or 1-2% sometimes.

hmm indeed :S

davidhq commented 6 years ago

I think I have something:

I installed v3.1.1 (44fbe8b5) on 3 of the devices:

Jun 16 17:09:00 kitchen shairport-sync[812]:        1.2,    -903.5,     903.5,        2006,      0,      0,      0,      0,   6124,  263,  265
Jun 16 17:09:08 kitchen shairport-sync[812]:        0.9,    -770.4,     781.7,        3009,      0,      0,      0,      0,   6127,  262,  265
Jun 16 17:09:16 kitchen shairport-sync[812]:        0.8,    -790.2,     790.2,        4012,      0,      0,      0,      0,   6117,  261,  265
Jun 16 17:09:24 kitchen shairport-sync[812]:        1.1,    -861.1,     861.1,        5015,      0,      0,      0,      0,   6082,  262,  265
Jun 16 17:09:32 kitchen shairport-sync[812]:        1.1,    -926.2,     926.2,        6018,      0,      0,      0,      0,   6107,  263,  265
Jun 16 17:09:40 kitchen shairport-sync[812]:        1.1,    -844.1,     844.1,        7021,      0,      0,      0,      0,   5966,  263,  265
Jun 16 17:09:48 kitchen shairport-sync[812]:        0.9,    -832.7,     832.7,        8024,      0,      0,      0,      0,   6118,  263,  265
Jun 16 17:09:56 kitchen shairport-sync[812]:        1.0,    -793.1,     793.1,        9027,      0,      0,      0,      0,   6094,  262,  265
Jun 16 17:10:04 kitchen shairport-sync[812]:        1.1,    -810.1,     810.1,       10030,      0,      0,      0,      0,   6182,  263,  265

Jun 16 17:08:55 outside shairport-sync[492]:        1.0,    -798.7,     798.7,       36108,      0,      0,      0,      0,   6243,  259,  265
Jun 16 17:09:03 outside shairport-sync[492]:        1.0,    -858.2,     858.2,       37111,      0,      0,      0,      0,   6118,  261,  265
Jun 16 17:09:11 outside shairport-sync[492]:        1.1,    -875.2,     875.2,       38114,      0,      0,      0,      0,   6128,  261,  265
Jun 16 17:09:19 outside shairport-sync[492]:        1.0,    -827.1,     827.1,       39117,      0,      0,      0,      0,   6262,  262,  265
Jun 16 17:09:27 outside shairport-sync[492]:        1.0,    -773.2,     773.2,       40120,      0,      0,      0,      0,   6226,  262,  265
Jun 16 17:09:35 outside shairport-sync[492]:        1.0,    -878.0,     878.0,       41123,      0,      0,      0,      0,   6209,  263,  265
Jun 16 17:09:43 outside shairport-sync[492]:        1.1,    -861.1,     861.1,       42126,      0,      0,      0,      0,   6173,  263,  265
Jun 16 17:09:51 outside shairport-sync[492]:        1.0,    -846.9,     846.9,       43129,      0,      0,      0,      0,   6257,  263,  265
Jun 16 17:09:59 outside shairport-sync[492]:        1.0,    -807.2,     807.2,       44132,      0,      0,      0,      0,   6278,  262,  265
Jun 16 17:10:07 outside shairport-sync[492]:        1.0,    -841.2,     841.2,       45135,      0,      0,      0,      0,   6183,  263,  265
Jun 16 17:10:15 outside shairport-sync[492]:        1.1,    -880.9,     880.9,       46138,      0,      0,      0,      0,   6087,  263,  265
Jun 16 17:10:23 outside shairport-sync[492]:        1.0,    -778.9,     778.9,       47141,      0,      0,      0,      0,   6198,  262,  265
Jun 16 17:10:31 outside shairport-sync[492]:        1.0,    -844.1,     844.1,       48144,      0,      0,      0,      0,   6155,  263,  265
Jun 16 17:10:39 outside shairport-sync[492]:        1.1,    -883.7,     883.7,       49147,      0,      0,      0,      0,   6177,  263,  265
Jun 16 17:10:47 outside shairport-sync[492]:        1.1,    -838.4,     838.4,       50150,      0,      0,      0,      0,   6104,  262,  265

Jun 16 17:09:32 midroom shairport-sync[361]:        0.1,      17.0,    1756.1,       54162,      0,      0,      0,      0,   6001,  263,  265
Jun 16 17:09:40 midroom shairport-sync[361]:        0.1,      -2.8,    1758.9,       55165,      0,      0,      0,      0,   5966,  263,  265
Jun 16 17:09:48 midroom shairport-sync[361]:        0.1,      48.2,    1764.6,       56168,      0,      0,      0,      0,   6095,  263,  265
Jun 16 17:09:56 midroom shairport-sync[361]:        0.1,      28.3,    1761.8,       57171,      0,      0,      0,      0,   6028,  262,  265
Jun 16 17:10:04 midroom shairport-sync[361]:        0.1,     -14.2,    1736.3,       58174,      0,      0,      0,      0,   5962,  253,  265
Jun 16 17:10:12 midroom shairport-sync[361]:       -0.0,      45.3,    1744.8,       59177,      0,      0,      0,      0,   6059,  263,  265
Jun 16 17:10:20 midroom shairport-sync[361]:        0.1,      17.0,    1790.1,       60180,      0,      0,      0,      0,   6018,  262,  265
Jun 16 17:10:28 midroom shairport-sync[361]:        0.0,      39.7,    1727.8,       61183,      0,      0,      0,      0,   6093,  263,  265
Jun 16 17:10:36 midroom shairport-sync[361]:        0.1,       2.8,    1741.9,       62186,      0,      0,      0,      0,   6093,  263,  265
Jun 16 17:10:44 midroom shairport-sync[361]:        0.0,      39.7,    1761.8,       63189,      0,      0,      0,      0,   6036,  262,  265
Jun 16 17:10:52 midroom shairport-sync[361]:       -0.0,      17.0,    1750.4,       64192,      0,      0,      0,      0,   6132,  262,  265
Jun 16 17:11:00 midroom shairport-sync[361]:       -0.0,      42.5,    1747.6,       65195,      0,      0,      0,      0,   5981,  254,  265

There is no more packet loss (resends). I also added another Raspberry in perfect line of sight to the access point (3m from it). "net correction in ppm" is low on this one... so does this really have to do with clock or is wifi behind a wall the reason?

Latest development version still has lots of resends:

Jun 16 17:10:35 ela shairport-sync[512]:   Type: "Server", content: "AirTunes/105.1"
Jun 16 17:10:38 ela shairport-sync[512]: Packet reception interval stats: mean, standard deviation and max for the last 2,500 packets in microseconds:     7620.9,     7061.5,    21934.0.
Jun 16 17:10:42 ela shairport-sync[512]:        0.8,    -787.4,     810.1,      227681,    415,  25021,    327,  31542,   5293,  117,  267
Jun 16 17:10:50 ela shairport-sync[512]:        0.9,    -781.7,     793.1,      228684,    415,  25309,    333,  31917,   5412,   88,  267
Jun 16 17:10:55 ela shairport-sync[512]: waiting for a mutex, maximum expected time of 30000 microseconds exceeded "player.c:802".
Jun 16 17:10:55 ela shairport-sync[512]: debug_mutex_lock at "player.c:802" expected max wait: 0.030000000, actual wait: 0.030251255 sec.
Jun 16 17:10:56 ela shairport-sync[512]: Packet reception interval stats: mean, standard deviation and max for the last 2,500 packets in microseconds:     7422.9,     8083.3,    23558.0.
Jun 16 17:10:58 ela shairport-sync[512]:        1.1,    -917.7,     917.7,      229687,    415,  25619,    354,  32305,   5447,   71,  267
Jun 16 17:11:06 ela shairport-sync[512]:        1.0,    -844.1,     844.1,      230690,    415,  25856,    354,  32542,   5500,  111,  267
Jun 16 17:11:14 ela shairport-sync[512]:        0.8,    -807.2,     829.9,      231693,    415,  25856,    354,  32542,   5664,  200,  267

Hopefully I didn't overlook something?

mikebrady commented 6 years ago

Thanks for the very interesting update. NTP only ensures the time-of-day is correct. If the system clock is running slow or fast, then the NTP protocol will make appropriate corrections to the time, but it doesn't actually speed up or slow down the clock itself, so I'm afraid that's not really relevant.

Given that the net correction in ppm for the Kitchen device is almost the same irrespective of the source, maybe the problem lies with the clock on the Kitchen device, that it's way out of whack. If that were so, then – unless you were very unlucky – the other Shairport Sync devices shouldn't register net correction in ppm value as large. If they were in the range ±150 ppm from the Mac, I think that would be okay. My recent experience with iOS on an iPhone 6 is that the correction will be less than 30 ppm on average, and will often be zero for long periods.

mikebrady commented 6 years ago

Our posts crossed. The midroom figures look more-or-less normal. The kitchen, outside and ela devices all have very high net correction figures and very high resend request levels.

If would be interesting indeed to put the latest version on the midroom and to put the version running on midroom on one of the other devices and see what happens. (It's very easy to get confused doing this -- believe me :))

It's hard to believe that you could have two Raspberry Pis with clocks that are so far out of whack. So, do you have something else running on them that might be loading up the USB or Etherenet subsystems? (It used to be that Pis would drop packets and other undesirable things if their USB ports were heavily loaded -- it seems the USB ports and the Ethernet port, and maybe WiFi, share the same output bus.)

mikebrady commented 6 years ago

I think I have something:

I installed v3.1.1 (44fbe8b) on 3 of the devices:

Jun 16 17:09:00 kitchen shairport-sync[812]: 1.2, -903.5, 903.5, 2006, 0, 0, 0, 0, 6124, 263, 265 Jun 16 17:09:08 kitchen shairport-sync[812]: 0.9, -770.4, 781.7, 3009, 0, 0, 0, 0, 6127, 262, 265 Jun 16 17:09:16 kitchen shairport-sync[812]: 0.8, -790.2, 790.2, 4012, 0, 0, 0, 0, 6117, 261, 265 Jun 16 17:09:24 kitchen shairport-sync[812]: 1.1, -861.1, 861.1, 5015, 0, 0, 0, 0, 6082, 262, 265 Jun 16 17:09:32 kitchen shairport-sync[812]: 1.1, -926.2, 926.2, 6018, 0, 0, 0, 0, 6107, 263, 265

What source were you using in this case?

davidhq commented 6 years ago

Actually midroom is on low values only when playing to default jack... I didn't take that into consideration before, so:

Jun 16 17:45:40 midroom shairport-sync[366]:   period_size = 256 frames (precisely).
Jun 16 17:45:40 midroom shairport-sync[366]:   buffer_time = 743038 us (>).
Jun 16 17:45:40 midroom shairport-sync[366]:   buffer_size = 32768 frames (>).
Jun 16 17:45:40 midroom shairport-sync[366]:   periods_per_buffer = 128 (precisely).
Jun 16 17:45:49 midroom shairport-sync[366]:       -2.9,     420.1,     677.4,        1003,      0,      0,      0,      0,   5862,  258,  264
Jun 16 17:45:57 midroom shairport-sync[366]:       -0.1,     104.8,    1685.3,        2006,      0,      0,      0,      0,   6084,  261,  264
Jun 16 17:46:05 midroom shairport-sync[366]:        0.0,      42.5,    1713.6,        3009,      0,      0,      0,      0,   6025,  262,  264
Jun 16 17:46:13 midroom shairport-sync[366]:        0.1,      22.7,    1722.1,        4012,      0,      0,      0,      0,   6070,  261,  264

when

defaults.ctl.card 0
defaults.pcm.card 0

and

Jun 16 18:00:30 midroom shairport-sync[352]:   periods_per_buffer = 341315 (>).
Jun 16 18:00:39 midroom shairport-sync[352]:        2.3,    -653.6,     653.6,        1003,      0,      0,      0,      0,   6267,  263,  264
Jun 16 18:00:47 midroom shairport-sync[352]:        1.2,    -900.7,     900.7,        2006,      0,      0,      0,      0,   6206,  262,  264
Jun 16 18:00:55 midroom shairport-sync[352]:        1.0,    -790.2,     790.2,        3009,      0,      0,      0,      0,   6175,  261,  264
Jun 16 18:01:03 midroom shairport-sync[352]:        1.1,    -863.9,     863.9,        4012,      0,      0,      0,      0,   6037,  252,  264
Jun 16 18:01:11 midroom shairport-sync[352]:        1.0,    -835.6,     835.6,        5015,      0,      0,      0,      0,   6124,  262,  264
Jun 16 18:01:27 midroom shairport-sync[352]:        1.1,    -827.1,     827.1,        6018,      0,      0,      0,      0,   6271,  262,  264
Jun 16 18:01:35 midroom shairport-sync[352]:        1.0,    -810.1,     810.1,        7021,      0,      0,      0,      0,   6281,  262,  264
Jun 16 18:01:43 midroom shairport-sync[352]:        1.1,    -810.1,     810.1,        8024,      0,      0,      0,      0,   6182,  261,  264

when

defaults.ctl.card 1
defaults.pcm.card 1

What really is different on 3.1.1 is no resends...

I couldn't read your two posts in detail right now (in hurry), I just managed to test what I'm reporting here... does it change anything?

davidhq commented 6 years ago

I always use forked-daapd

$ /usr/sbin/forked-daapd -v
Forked Media Server: Version 26.0
Copyright (C) 2009-2015 Julien BLACHE <jb@jblache.org>
Based on mt-daapd, Copyright (C) 2003-2007 Ron Pedde <ron@pedde.com>
Released under the GNU General Public License version 2 or later

davidhq commented 6 years ago

USB soundcard in all cases is:

https://www.aliexpress.com/item/Ugreen-External-USB-Audio-Sound-Card-Mic-Adapter-Speaker-3-5mm-Jack-Stereo-Audio-Cable-Headset/32802432756.html

Components:
Realtek ALC4040

http://www.wpgholdings.com/yosung/news_detail/zhtw/program/21388

ALC4040 Series
• Tensilica USB Audio Core inside
• Digital-to-Analog Converter with 100dBA SNR
• Analog-to-Digital Converter with 94dBA SNR
• Stereo digital microphone and analog microphone inputs
• Power management and enhanced power saving
• Single digital power supply from 1.6v to 3.6v.
• Small Package : QFN48 6mmx6mm, CSP28  4mmx3.5mm

(I broken-open one of them), probably not relevant, but still

mikebrady commented 6 years ago

Thanks for all this. It is possible to turn off resend requests in the development version. There is a diagnostics section in the configuration file and you can turn it off there. It may be that the resend requester is too aggressive.

mikebrady commented 6 years ago

Right, so it looks like the USB DACs might be the cause of the very large level of net correction. That's a (possible) discovery. If you could show some stats of a system that has been generating lots of resend requests but now with them turned off, I wonder how what fraction of packets are actually missed in the end... If it turns out to be small, then I could lower the action of the resend requester...

davidhq commented 6 years ago

ok, later today

davidhq commented 6 years ago

https://gist.github.com/davidhq/fadbadd64b93dad31b3c8d820a4d95c4

Two sessions, first disable_resend_requests: no, then yes (~40min)

mikebrady commented 6 years ago

Thanks David -- I can only see one session, the one with disable set to "yes"...

davidhq commented 6 years ago

Sorry: https://gist.github.com/davidhq/18dd87d006bcec62f5eb6567e12f53ad

Here are previous couple....

davidhq commented 6 years ago

Except that I think other than this 12 min session this device was testing 3.1.1 :) so I think there is one useful session from this only.. do you need more? I will set it to 'no' now and play on it...

mikebrady commented 6 years ago

Yes please — it Is fascinating. It does look like the resend requesting is too aggressive, but it would be great to get a picture of it with requesting back on, if it’s not too inconvenient.

davidhq commented 6 years ago

Yes, here is another 28min session in addition to 12min above: https://gist.github.com/davidhq/69ec73f7a1d319223bcb66ad89d6c91d

davidhq commented 6 years ago

Just let me know if I should do the same for other devices except 'lab' and for how long

mikebrady commented 6 years ago

Thanks again. I have two hard days work to get through before I can get back to this, so perhaps we can pause until then. I’ll be thinking about the very interesting evidence in the meantime.

davidhq commented 6 years ago

SURE! Enjoy the next days, until soon... I will now disable resend requests on all devices and let you know if there were any problems on the outside with this on the latest development version.

davidhq commented 6 years ago

Hello Mike, reporting my findings in last week...

Not sure if it's good news or not so much, but version 3.1.1 works almost without issues. There is no "stuttering", so far there is no thread stop / spawn issues (I think it only happened once but I'm not sure anymore, so if it happens it's very rare, not like 3.2* where it was regular occurrence).

There was one other problem only twice, seems not related to the two issues above and forked-daapd could be the source of it. What happened was songs were rewinding, restarting, skipping in a strange way - I thought my girlfriend was switching songs, and she thought I was :) So either ghosts, russian hackers, or some other third issue which for now doesn't matter but I will investigate later.

mikebrady commented 6 years ago

Thanks for the update, David. I have just pushed a development update with a more gentle resend request algorithm. At your leisure, I'd be grateful if you'd try it out. It's interesting that 3.1.1 works well for you, but AFAIK it would have the same (or worse) thread stop logic in it. Thus, I'd be anxious to wring the bugs out of the development branch.

davidhq commented 6 years ago

Hello Mike, I will test this week... so since last report there was not one thread stop problem with 3.1.1. So if the code is different, it's definitely not worse in my case... maybe it's just a lucky coincidence that constellation of code in 3.1.1 doesn't cause issues even if it's messy.

Regarding the other problem with dropping sound: also not a single occurrence since last report 6 days ago either.

I have a question: resending requests should be useful to maintain a perfect reproduction when packets are lost? And the right amount is determined by trial and error? I personally would rather not have this since 3.1.1 without resending works great. Could be that this feature is overdoing it a little? I'm willing to test as I said but now at leat according to my current understanding this is not something I'd use later when I'm not testing... it adds complexity and potential problems with no obvious benefit. I'm sure I don't understand it correctly but still, speaking from practical experience.

Until soon, thank you

github-actions[bot] commented 3 years ago

This issue has been inactive for 60 days so will be closed 7 days from now. To prevent this, please remove the "stale" label or post a comment.

mikebrady / shairport-sync

Too long restart on rpi2, 3 #653