[BUG] False "relay shutdown request" shutting down all relays

niko commented 10 years ago

This is basically reopening #67, but with updated information after having fixed a broken configuration.

Icecast version is 24-g3cb456b.

Usually at high listener counts (+7000) the relay server logs "relay shutdown requests" although the list of stations to relay as returned by the master looks all right:

$ grep 'relay shutdown request' /export/logs/icecast-error.log | awk -F ':' '{print $1":"$2}' | uniq -c
      1 [2013-12-12  07:14
      1 [2013-12-12  07:46
      1 [2013-12-12  07:54
      1 [2013-12-12  08:17
      1 [2013-12-12  09:06
      1 [2013-12-12  09:10
      1 [2013-12-12  09:12
      1 [2013-12-12  09:40
      1 [2013-12-12  09:52
      1 [2013-12-12  10:04
      1 [2013-12-12  10:26
      1 [2013-12-12  10:27
      1 [2013-12-12  10:45
      1 [2013-12-12  11:31
      1 [2013-12-12  11:32
      1 [2013-12-12  11:34
      1 [2013-12-12  11:40
      1 [2013-12-12  11:49
      1 [2013-12-12  12:02
      1 [2013-12-12  12:13
      1 [2013-12-12  12:16
      1 [2013-12-12  12:27
      1 [2013-12-12  12:32
      1 [2013-12-12  12:44
      1 [2013-12-12  12:49
      1 [2013-12-12  13:02
      1 [2013-12-12  13:06
      1 [2013-12-12  13:16
      1 [2013-12-12  13:35
      1 [2013-12-12  13:39
      1 [2013-12-12  13:40
      1 [2013-12-12  14:00
      1 [2013-12-12  14:07
    440 [2013-12-12  14:18
   1362 [2013-12-12  14:24
   1021 [2013-12-12  14:34
   1361 [2013-12-12  14:42
      1 [2013-12-12  14:43
      1 [2013-12-12  15:02
      1 [2013-12-12  15:07
      1 [2013-12-12  15:09

The first of the bunch of "shutdown request" log entries was at 14:18:55.

At the same time we're logging listener counts on the relay and the station counts on the master server (S:[stationcount] L:[listenercount] DS:[stationcount download time] DL:[listener count download time]):

2013-12-12_14:17:15 S:1362 L:7546 DS:0 DL:1
2013-12-12_14:17:35 S:1362 L:7555 DS:0 DL:0
2013-12-12_14:17:55 S:1362 L:7533 DS:0 DL:0
2013-12-12_14:18:16 S:1362 L:7531 DS:0 DL:0
2013-12-12_14:18:36 S:1362 L:7546 DS:0 DL:0
2013-12-12_14:18:57 S:1362 L:4489 DS:0 DL:1
2013-12-12_14:19:19 S:1362 L:4580 DS:2 DL:0
2013-12-12_14:19:39 S:1362 L:4600 DS:0 DL:0
2013-12-12_14:20:00 S:1362 L:4651 DS:0 DL:1
2013-12-12_14:20:21 S:1362 L:5244 DS:0 DL:1
2013-12-12_14:20:42 S:1362 L:5428 DS:0 DL:1
2013-12-12_14:21:04 S:1362 L:5541 DS:2 DL:0
2013-12-12_14:21:24 S:1362 L:5604 DS:0 DL:0
2013-12-12_14:21:45 S:1362 L:5621 DS:0 DL:1
2013-12-12_14:22:06 S:1362 L:5684 DS:0 DL:1
2013-12-12_14:22:26 S:1362 L:5728 DS:0 DL:0
2013-12-12_14:22:46 S:1362 L:5739 DS:0 DL:0
2013-12-12_14:23:07 S:1362 L:5767 DS:1 DL:0
2013-12-12_14:23:27 S:1362 L:5804 DS:0 DL:0
2013-12-12_14:23:48 S:1362 L:5832 DS:0 DL:1
2013-12-12_14:24:09 S:1362 L:5905 DS:0 DL:1
2013-12-12_14:24:29 S:1362 L:5944 DS:0 DL:0
2013-12-12_14:24:49 S:1362 L:5975 DS:0 DL:0
2013-12-12_14:25:09 S:1362 L:0 DS:0 DL:0
2013-12-12_14:25:30 S:1362 L:0 DS:0 DL:1
2013-12-12_14:25:50 S:1362 L:0 DS:0 DL:0
2013-12-12_14:26:13 S:1362 L:273 DS:0 DL:3
2013-12-12_14:26:34 S:1362 L:1449 DS:0 DL:1
2013-12-12_14:26:54 S:1362 L:1856 DS:0 DL:0
2013-12-12_14:27:14 S:1362 L:2099 DS:0 DL:0
2013-12-12_14:27:37 S:1362 L:2259 DS:1 DL:2

So although the master server seems to have returned a complete list of relay stations the relay is dropping the stations.

Also once the relay is in this "state", stations are dropped again and again until it is restarted.

listeners

In this images at 14:18 you can see the first drop of stations at 14:18, then another two drop until at 14:44.

I made 3 codedumps at (one at 14:39 and two at 14:44), the first while the server seemed to be recovering, the last two while all stations where gone. @karlheyes : I'll drop you an email with the download location.

Niko.

niko commented 10 years ago

This is the error log on the master at the same time, just grepping for "listener count on /STATIONNAME":

[2013-12-12  14:18:57] INFO source/source_read listener count on /STATIONNAME now 1
[2013-12-12  14:20:00] INFO source/source_read listener count on /STATIONNAME now 2
[2013-12-12  14:24:58] INFO source/source_read listener count on /STATIONNAME now 1
[2013-12-12  14:26:15] INFO source/source_read listener count on /STATIONNAME now 2
[2013-12-12  14:34:58] INFO source/source_read listener count on /STATIONNAME now 1
[2013-12-12  14:35:59] INFO source/source_read listener count on /STATIONNAME now 2
[2013-12-12  14:42:57] INFO source/source_read listener count on /STATIONNAME now 1
[2013-12-12  14:44:04] INFO source/source_read listener count on /STATIONNAME now 2

It seems obvious that the source has always been available.

niko commented 10 years ago

This is the log on which the graph is bases upon:

1386854221,7554,-,X
1386854281,7537,-,X
1386854341,4533,-,X
1386854401,5068,-,X
1386854461,5526,-,X
1386854521,5666,-,X
1386854581,5770,-,X
1386854641,5857,-,X
1386854701,0,-,X
1386854761,88,-,X
1386854821,1953,-,X
1386854881,2427,-,X
1386854942,2703,-,X
1386855001,2930,-,X
1386855061,3225,-,X
1386855121,3371,-,X
1386855181,3453,-,X
1386855241,3627,-,X
1386855301,628,-,X
1386855361,-,-,X
1386855421,2191,-,X
1386855481,2627,-,X
1386855541,2789,-,X
1386855601,3052,-,X
1386855661,3253,-,X
1386855721,3461,-,X
1386855781,0,-,X
1386855841,31,-,X
1386855901,1252,0,X
1386855961,1442,451,X
1386856021,1528,567,X
1386856081,1655,772,X
1386856141,1756,933,X
1386856201,1849,1007,X
1386856261,1898,1073,X
1386856321,1950,1114,X

The drop at 14:36 didn't affect all listeners as the graph suggests, but curl didn't succeed downloading the listener count. '-' means curl failure, 'X' means missing xsl file in Icecast. Our second server wasn't running at that time. The third just doesn't have the listeners xsl file installed. But interestingly the main server didn't respond in a manner curl likes. This is the - admittedly crude - logging script:

#!/bin/bash

TIMESTAMP=$(date '+%s')

L1=$(curl http://stream1/listeners.total.xsl || echo '-')
if [[ $L1 = *XSLT* ]]; then L1='X'; fi

L2=$(curl http://stream2/listeners.total.xsl || echo '-')
if [[ $L2 = *XSLT* ]]; then L2='X'; fi

L3=$(curl http://stream3/listeners.total.xsl || echo '-')
if [[ $L3 = *XSLT* ]]; then L3='X'; fi

echo "$TIMESTAMP,$L1,$L2,$L3"

Niko.

karlheyes commented 10 years ago

you could try a more recent build as well, there has been a relay update since then, if you do then get the latest as it has a fix for applying client details on restarting relays.

karl.

niko commented 10 years ago

Will try the new version tomorrow. Niko.

niko commented 10 years ago

After a quiet day the server with the new Icecast version just lost all its stations. In the INFO-logs it looks like this:

[2013-12-15  01:33:35] INFO slave/relay_reset servers to be retried on /maximum
[2013-12-15  01:33:35] INFO slave/relay_read fallback on /maximum attempted
[2013-12-15  01:33:35] INFO source/source_shutdown Source "/maximum" exiting
[2013-12-15  01:33:35] INFO source/source_set_fallback No fallback on /maximum
[2013-12-15  01:33:35] INFO slave/relay_read shutting down relay /maximum
[2013-12-15  01:33:35] INFO source/source_free_source source /maximum to be freed
[2013-12-15  01:33:35] INFO source/_free_source freeing source "/maximum"
[2013-12-15  01:34:12] INFO slave/start_relay_stream Starting relayed source at mountpoint "/maximum"
[2013-12-15  01:34:12] INFO slave/open_relay_connection using username relay for /maximum
[2013-12-15  01:34:12] INFO slave/open_relay_connection connecting to 10.0.0.100:8080 for /maximum
[2013-12-15  01:34:12] INFO format-mp3/format_mp3_get_plugin Created format details for /maximum
[2013-12-15  01:34:12] INFO format-mp3/mpeg_apply_client icy metadata format expected on /maximum, interval 16000
[2013-12-15  01:34:12] INFO source/source_init Source /maximum initialised
[2013-12-15  01:34:12] INFO source/source_apply_mount Applying mount information for "/maximum" from "/*"
[2013-12-15  01:34:12] INFO slave/start_relay_stream source /maximum is ready to start
[2013-12-15  01:34:12] INFO source/source_read listener count on /maximum now 0
[2013-12-15  01:34:12] INFO source/source_set_override no override for /maximum set
[2013-12-15  01:34:12] INFO format-mp3/parse_icy_metadata incoming title for /maximum Celsius911 - Sake (Promo) 2011

Two other relays (both running the "old" Icecast version) did not loose any stations. And again dropping all station reoccurred twice. I then restartet Icecast,

listeners

@karlheyes: core.29497 is at the usual location.

Niko.

karlheyes commented 10 years ago

I still need the icecast binary and the log leading up to the case, whether level 3 is enough remains to be seen.

karl.

niko commented 10 years ago

Sry, forgot about the binary. The binary is at /icecast-relay and the log at /icecast-error.log.gz.

Niko.

karlheyes commented 10 years ago

The slave error log is not reporting anything out of the ordinary, obviously level 3 is not that high in that regard. If level 4 is not possible then we would need to isolate the issue further, may be look at the access log on the master for the /admin/streams or /admin/streamlist entries to see what much was returned in comparison to what was there before, may be run a duplicate icecast with no listeners to see if the effect is duplicated. I assume the core is from gcore from some point after the fact, in which case it's of little help. I still suspect the master is in a mid update cycle

karl.

karlheyes commented 10 years ago

After more investigation, this looks to be because the master feeds were showing certain mountpoints missing, and the slave was reacting correctly to that. The actual problem stems from stats disappearing from the master in a drop/recreate procedure when streams started or stopped. The actual entries that went were those that are listed when there no stream but there is a fallback present.

The recent patch uses a timestamp that does not apply for normal sources but is set to the current time for these fallback-only mounts. These are updated on a subsequent rechecks if found to be still required and only after these rechecks are expired stats purged. This should prevent the drop/recreate which is getting detected by the slave.

If you can confirm this then we'll close this report.

karl.

niko commented 10 years ago

We're running the latest version now for 2 days. I'm carefully optimistic. I for myself will consider this issue closed it the slaves don't "loose" (for whatever reason) arbitrary streams for another 5 days.

Thanks for your efforts in this nasty edge case. I suspect that a large number of "virtual" mountpoints without sources but only fallbacks isn't the every day usecase. So kudos for catching this one.

All the best, Niko.

karlheyes commented 10 years ago

shall I assume this is no longer showing as a problem now?

karl.

niko commented 10 years ago

Yes, we didn't have any occurences of this since the last update. Thanks! :)

karlheyes / icecast-kh

[BUG] False "relay shutdown request" shutting down all relays #72