[BUG] False "relay shutdown request" shutting down all relays

niko commented 10 years ago

We're experiencing an issue where a slave would repeatedly stop relaying all or most of the mountpoints and log "relay shutdown request" for each one, thus losing all or almost all listeners. This would then repeat every few minutes until the slave is restarted.

In the logs this looks like this:

[…]
[2013-11-01  08:34:22] INFO slave/update_relays relay shutdown request on "/stonerock"
[…]

$ grep -h 'relay shutdown request on' /export/logs/icecast-error.log.201311* | awk -F ':' '{print $1":"$2}' | sort | uniq -c
[…]
      1 [2013-11-03  11:23
      1 [2013-11-03  11:35
      1 [2013-11-03  11:40
      1 [2013-11-03  12:28
      1 [2013-11-03  12:30
      1 [2013-11-03  13:17
    706 [2013-11-03  13:26
      1 [2013-11-03  13:28
    784 [2013-11-03  13:29
      1 [2013-11-03  13:31
   1270 [2013-11-03  13:33
    734 [2013-11-03  13:37
   1271 [2013-11-03  13:41
   1165 [2013-11-03  13:47
      1 [2013-11-03  14:09
   1172 [2013-11-03  14:18
      1 [2013-11-03  14:23
    887 [2013-11-03  14:40
    618 [2013-11-03  14:44
      1 [2013-11-03  14:49
      1 [2013-11-03  14:53
      1 [2013-11-03  14:56
    652 [2013-11-03  14:58
      1 [2013-11-03  15:04
    153 [2013-11-03  15:16
    788 [2013-11-03  15:22
   1271 [2013-11-03  15:34
      1 [2013-11-03  16:02
      1 [2013-11-03  16:04
    435 [2013-11-03  16:06
   1271 [2013-11-03  16:14
      1 [2013-11-03  16:15
   1104 [2013-11-03  16:22
   1270 [2013-11-03  16:44
      1 [2013-11-03  16:46
   1191 [2013-11-03  16:54
   1268 [2013-11-03  16:58
   1270 [2013-11-03  17:06
   1165 [2013-11-03  17:32
    690 [2013-11-03  17:46
      1 [2013-11-03  17:47
      1 [2013-11-03  18:03
      1 [2013-11-03  18:06
      1 [2013-11-03  18:10
      1 [2013-11-03  18:11
      1 [2013-11-03  18:20
   1172 [2013-11-03  18:30
   1214 [2013-11-03  18:34
      1 [2013-11-03  18:43
      1 [2013-11-03  18:46
      1 [2013-11-03  18:59
    706 [2013-11-03  19:00
      1 [2013-11-03  19:09
      1 [2013-11-03  19:23
   1271 [2013-11-03  19:34
      1 [2013-11-03  19:43
      1 [2013-11-03  19:48
    961 [2013-11-03  19:50
   1268 [2013-11-03  20:00
   1271 [2013-11-03  20:16
      1 [2013-11-03  20:22
      1 [2013-11-03  20:25
      1 [2013-11-03  20:32
      1 [2013-11-03  20:48
      1 [2013-11-03  20:51
[…]

I restartet the slave at 20:17.

We have another slave relaying the same master, although a.t.m. there are no listeners on it as it acts as a hot standby. It doesn't show the same pattern in the log.

I've upped the loglevel to DEBUG yesterday. On loglevel INFO the note of the shutdown request is the only thing showing up.

I dont' know if the issue relates to our master setup. I'll describe it just in case. In our master (called "collector") we have configured an explicit fallback chain for each of the 1200 odd mountpoints. This way we're collecting stations from several backend servers on one central master which then is relayed by several slaves.

                           -----------------------collector----------------------          -----slave1----
                    expl.  |  /station1      /station2      /station3           |  impl.   |  /station1  |
-----backend1-----  relay  |       |f.            |f.            |              |  relay   |  /station2  |
|/station1       |-------->|  /station1.backend1  |              |f             |--------->|  /station3  |
|/station2       |-------->|       |         /station2.backend1  |a             |          ---------------
------------------         |       |f             |              |l             |
                    expl.  |       |a             |f.            |l             |          -----slave2----
-----backend2-----  relay  |       |l             |              |b.            |  impl.   |  /station1  |
|/station2       |-------->|       |l        /station2.backend2  |              |  relay   |  /station2  |
|/station3       |-------->|       |b                       /station3.backend2  |--------->|  /station3  |
------------------         |       |a                            |              |          ---------------
                    expl.  |       |c                            |f.            |
-----backend3-----  relay  |       |k                            |              |
|/station3       |-------->|       |                        /station3.backend3  |
|/station1       |-------->|  station1.backend3                                 |
------------------         ------------------------------------------------------

In this diagram only the slaves are publicly accessible. In the collector Iceast all but the plain mountpoints are hidden and not public. This way we have redundant backends which can even hand over one station from one backend to another without interrupting the listener experience.

Regards, Niko.

karlheyes commented 10 years ago

Can you try to match up a selection of relays showing this, against what is happening on the master. A streamlist update (or /admin/streams) will appear every so often on the master access log, that should be the trigger for the startup or shut down of relays. If a particular relay is missing off the list then the slave would trigger a shut down of the relay.

I'm not that far off from merging in the code that allows the slave to use a stats link to get relay updates, doing so will mean relays are started, shut down immediately and metadata will be passed for inactive relays. The fallback position would be to use the existing approach.

karl.

niko commented 10 years ago

There's nothing in the master INFO logs. I upped the loglevel to debug so next time this happens we'll have DEBUG logs.

Niko.

niko commented 10 years ago

I finally got DEBUG logs of the master when the slave disconnects. It's not too much, I'm afraid just a single line like this for each station:

[2013-11-13  00:15:20] DBUG stats/_free_source_stats delete source node /STATION_NAME

Just grepping for a single station on the MASTER_SERVER log:

[2013-11-13  00:15:13] DBUG stats/modify_node_event update "/STATION_NAME.BACKEND1:8000" total_mbytes_sent (5308)
[2013-11-13  00:15:13] DBUG stats/modify_node_event update "/STATION_NAME.BACKEND1:8000" queue_size (100310)
[2013-11-13  00:15:13] DBUG stats/modify_node_event update "/STATION_NAME.BACKEND1:8000" connected (174033)
[2013-11-13  00:15:19] DBUG stats/modify_node_event update "/STATION_NAME.BACKEND1:8000" outgoing_kbitrate (251)
[2013-11-13  00:15:19] DBUG stats/modify_node_event update "/STATION_NAME.BACKEND1:8000" incoming_bitrate (128064)
[2013-11-13  00:15:19] DBUG stats/modify_node_event update "/STATION_NAME.BACKEND1:8000" total_bytes_read (2784828942)
[2013-11-13  00:15:19] DBUG stats/modify_node_event update "/STATION_NAME.BACKEND1:8000" total_bytes_sent (5566873600)
[2013-11-13  00:15:19] DBUG stats/modify_node_event update "/STATION_NAME.BACKEND1:8000" total_mbytes_sent (5308)
[2013-11-13  00:15:19] DBUG stats/modify_node_event update "/STATION_NAME.BACKEND1:8000" queue_size (85263)
[2013-11-13  00:15:19] DBUG stats/modify_node_event update "/STATION_NAME.BACKEND1:8000" connected (174039)
[2013-11-13  00:15:20] DBUG stats/_free_source_stats delete source node /STATION_NAME
[2013-11-13  00:15:20] DBUG source/source_recheck_mounts fallback checking /STATION_NAME (fallback has 2)
[2013-11-13  00:15:20] DBUG stats/stats_handle new source stat /STATION_NAME
[2013-11-13  00:15:20] DBUG stats/process_source_stat new node on /STATION_NAME "listenurl" (http://MASTER_SERVER:8080/STATION_NAME)
[2013-11-13  00:15:20] DBUG stats/process_source_stat new node on /STATION_NAME "listeners" (0)
[2013-11-13  00:15:20] DBUG stats/process_source_stat new node on /STATION_NAME "max_listeners" (0)
[2013-11-13  00:15:22] INFO source/source_read listener count on /STATION_NAME.BACKEND1:8000 now 1
[2013-11-13  00:15:22] DBUG stats/modify_node_event update "/STATION_NAME.BACKEND1:8000" listeners (1)
[2013-11-13  00:15:25] DBUG stats/modify_node_event update "/STATION_NAME.BACKEND1:8000" outgoing_kbitrate (204)
[2013-11-13  00:15:25] DBUG stats/modify_node_event update "/STATION_NAME.BACKEND1:8000" incoming_bitrate (128440)
[2013-11-13  00:15:25] DBUG stats/modify_node_event update "/STATION_NAME.BACKEND1:8000" total_bytes_read (2784929258)
[2013-11-13  00:15:25] DBUG stats/modify_node_event update "/STATION_NAME.BACKEND1:8000" total_bytes_sent (5567021056)
[2013-11-13  00:15:25] DBUG stats/modify_node_event update "/STATION_NAME.BACKEND1:8000" total_mbytes_sent (5309)

The configuration on the MASTER_SERVER for each station is as follows:

<mount><mount-name>/STATION_NAME</mount-name>
    <public>1</public><hidden>0</hidden>
    <max-listeners>0</max-listeners>
    <fallback-mount>/STATION_NAME.BACKEND1:8000</fallback-mount><fallback-override>1</fallback-override>
  </mount>
  <mount><mount-name>/STATION_NAME.BACKEND1:8000</mount-name>
    <max-listeners>0</max-listeners>
    <public>0</public><hidden>1</hidden>
    <fallback-mount>/.silence</fallback-mount><fallback-override>1</fallback-override>
  </mount>
  <relay><mount>/STATION_NAME</mount>
    <server>BACKEND1</server>
    <port>8000</port>
    <local-mount>/STATION_NAME.BACKEND1:8000</local-mount>
</relay>

This looks like a stupid setup because at the moment there's only one backend configured. Some stations already have a live mountpoint defined in the fallback chain. And we're preparing to add another backend for each station. We're aiming for a setup where we can move a station from one backend to another without service interruption for the listeners.

Niko.

karlheyes commented 10 years ago

are you saying that beside stats, nothing is happening with the mountpoints in question, not stopping or starting for instance but the relay is not seeing the stream from the listing?

karl.

niko commented 10 years ago

Exactly. Perhaps it's due to the excessive configuration of the master server. We're talking about 22000 lines of icecast config xml which are newly generated and reloaded once per hour. I stopped reloading for now but we still have relay shutdowns. Strange thing is we don't have the relay shutdowns on the second relay server. But the second server does not have any listeners as it's merely a hot standby. I'll redirect some listeners to it and see if the relay shutdowns do occur there as well.

Niko.

karlheyes commented 10 years ago

The size of the xml won't be an issue on the listing side of things, the list works by going through the stats on the master, so if there are 4000 streams then the stream listing should be about that (assuming no hidden ones etc, http://host:port/admin/streams).

The actual relay start up shut down code is to be altered, it still has got a fair bit of legacy code in from 2.3 which is not quite true now with the KH tree, each relay now has a client on a worker so can be driven from a normal worker context instead of sporadically from the slave thread. What build is the master?

It's hard to know what is actually going on at the moment, but it may have to do with the reload on the master. If that is triggering a lot of updates and the slaves catch it mid-way then it may think the streams should be down. I suppose it depends on the frequency of slave updates and master reload (was 1 hour iirc)

karl.

niko commented 10 years ago

It's build 24-g3cb456b.

It's got nothing to do with the reloads. I turned of the cronjob and we're still experiencing these issues.

Niko.

karlheyes commented 10 years ago

The odd thing at the moment is knowing why on the slave, you would get the shutdown request message but no messages about the source itself, eg "shutting down relay /..." or anything about the stream listing attempt which should occur just before the shutdown requests occur.

karl.

karlheyes commented 10 years ago

I've finally got around to merging the relay changes I wanted to do. It affects the area you are referring to but there is a lot of changed code so firstly make sure it works in a limited way with whatever set up you have before rolling it out fully.

karl.

niko commented 10 years ago

Nice! Does it change the relay code for the master or relay?

Niko.

karlheyes commented 10 years ago

This update is purely the receiving end, so the slave or the one with xml relays. The master feed aspect is not changed.

karl.

niko commented 10 years ago

Hm. Looks like the very same error happened again with the new build. Threw out all the sources twice within 10 minutes:

$ grep source_shutdown icecast-error.log | awk '{print $1" "$2}' | sort |  uniq -c
...
      1 [2013-11-21 17:30:12]
      2 [2013-11-21 17:32:09]
      1 [2013-11-21 17:32:39]
    646 [2013-11-21 17:36:45]
    663 [2013-11-21 17:36:46]
    628 [2013-11-21 17:44:45]
    596 [2013-11-21 17:44:46]

karlheyes commented 10 years ago

Can there be a bit more clarity on what is actually happening. I'm assuming this is a list from a master. If so then the debug log should report the entries being received by the slave. Is it a case of the streams disappearing because if so then dropping the relays is the expected action. A failed to connect type of response though should not remove the relays.

karl.

niko commented 10 years ago

I've no idea. The last relay drop on the main production server was 5 days ago. Then I configured the configuration to be reloaded every 10 minutes on the master. I did so with the expectation that it could provoke more frequent drop outs… but that didn't happen. The new build had drop outs on the very same master almost immediately. I'll investigate further tomorrow.

Niko.

niko commented 10 years ago

Ok. I got a hint. I started logging the masters response to /admin/streams every 20 seconds. And sometimes the master just responds with an empty or a very short list. This is an aggregation of the log: https://gist.github.com/niko/7682404 showing the number of stations the /admin/streams response contained.

So the error is definitely in the master. But there's nothing in the INFO logs of the master (I switched of DEBUG logging last week after having collected 300GB logs).

Niko.

karlheyes commented 10 years ago

which build is running on the master? I'll check the code path in cases there is anything obvious

karl.

niko commented 10 years ago

It's build 24-g3cb456b.

Niko.

karlheyes commented 10 years ago

Are there any significant changes on the master going on at the time these short lists appear, a reload, new source starting or stopping. Can you let me know what amount is being sent (access log will show it) for the typical listing (and problem case). While a small improvement could be done in that routine, I don't seen any obvious flaw, but I wonder if it is related to the amount being reported.

karl.

niko commented 10 years ago

I can see the short station listings in the access log. This is one with just a single stations:

81.95.11.210 - admin [27/Nov/2013:17:04:00 +0100] "GET /admin/streams HTTP/1.1" 200 78 "-" "curl/7.22.0 (x86_64-pc-linux-gnu) libcurl/7.22.0 OpenSSL/1.
0.1 zlib/1.2.3.4 libidn/1.23 librtmp/2.3" 0

Usually the response should be something like 46k big… it's just 78 bytes here. I'm reloading the config every full 10 minutes, so reloading does not seem to have an effect.

Niko.

karlheyes commented 10 years ago

unless of course there is some issues between the update of the streams from the reload the collection of the stream details for the response. Changing the frequency of the reload should throw up a different pattern if that was the case, say every 13 minutes.

karl.

niko commented 10 years ago

We found out the the real cause for this issue is ultra high loads on the server, induced by the Icecast relay process. I'll close this issue and open a new one describing the load problem. It took us a while to figure out which was the symptom and which was the cause. Sorry for the fuzz.

Niko.

karlheyes / icecast-kh

[BUG] False "relay shutdown request" shutting down all relays #67