onedr0p / exportarr

AIO Prometheus Exporter for Sabnzbd, Bazarr, Prowlarr, Lidarr, Readarr, Radarr, and Sonarr
MIT License
558 stars 40 forks source link

Instability causes service to die #49

Closed SpangleLabs closed 1 year ago

SpangleLabs commented 2 years ago

Details

What steps did you take and what happened:

I have a couple exportarr docker images set up with sonarr and radarr in a docker compose file. Sometimes, radarr or sonarr fail a request, and return an empty response, and exportarr just logs a fatal error and goes down. Radarr/sonarr recover, but exportarr is still dead. Trying to reload the whole lot with docker-compose down and docker-compose up -d doesn't restart exportarr containers.

What did you expect to happen:

It should log an error, maybe wait a moment, and then try again next time. Or the docker image should restart it when it fails, but that then means you need a watcher and all sorts.

Anything else you would like to add:

This is the log line in question, but the fatal means it shuts down the exportarr service, which does not restart.

Additional Information:

It seems a bit silly for it to report a radarr_system_status/sonarr_system_status metric, if at the first sign of trouble, exportarr dies.

onedr0p commented 2 years ago

Can paste the error message exportarr gives when it dies?

What versions of sonarr and radarr are you running?

How often are your scraping the metrics with prometheus?

These would help debug this issue.

SpangleLabs commented 2 years ago

I ended up having to tear down my containers and networks and purge the lot to get it restarted, but I have a snippit of the logs I screenshot before I nuked things:

time="2022-02-11T21:18:41Z" level=info msg="Sending HTTP request to http://192.168.1.142:8989/api/v3/system/status"
time="2022-02-11T21:18:41Z" level=info msg="Sending HTTP request to http://192.168.1.142:8989/api/v3/health"
time="2022-02-11T21:19:41Z" level=info msg="Sending HTTP request to http://192.168.1.142:8989/api/v3/rootfolder"
time="2022-02-11T21:19:41Z" level=info msg="Sending HTTP request to http://192.168.1.142:8989/api/v3/health"
time="2022-02-11T21:19:41Z" level=info msg="Sending HTTP request to http://192.168.1.142:8989/api/v3/series"
time="2022-02-11T21:19:41Z" level=info msg="Sending HTTP request to http://192.168.1.142:8989/api/v3/queue?page=1"
time="2022-02-11T21:19:41Z" level=info msg="Sending HTTP request to http://192.168.1.142:8989/api/v3/system/status"
time="2022-02-11T21:19:41Z" level=info msg="Sending HTTP request to http://192.168.1.142:8989/api/v3/history"
time="2022-02-11T21:20:28Z" level=info msg="Sending HTTP request to http://192.168.1.142:8989/api/v3/queue?page=2"
time="2022-02-11T21:20:32Z" level=fatal msg="An error has occurred during retrieving statistics Get \"http://192.168.1.142:8989/api/v3/series\": EOF"

I'm not sure what version of exportarr was running before I nuked everything and pulled new ones. I had to change my docker compose from the command being ["exportarr", "sonarr"] to just ["sonarr"] though. I suspect it'll be whatever latest image I pulled about 7 weeks ago, some time in mid to late December. Sorry I can't give better numbers on that bit.

Sonarr is version 3.0.6.1342, Radarr is 4.0.4.5922 (but both of these may have changed when I nuked my setup and restarted it, but not changed significantly?)

Prometheus metrics are being scraped once a minute.

Apologies on not having great version numbers! This happened once or twice before this latest incident too.

I can see in my chatlogs, I noted that it happened on December 20th, with this error:

time="2021-12-20T02:22:10Z" level=fatal msg="An error has occurred during retrieving statistics HTTP statuscode 500"

And I had to tear it all down then also.

If it pops up again, I'll certainly get you more accurate version numbers. I'm now running whatever latest image I pulled today

onedr0p commented 2 years ago

I would suggest you pin the docker image to the most recent version and not use :latest.

SpangleLabs commented 2 years ago

True, that is usually more wise. I think I decided not to as it wasn't at v1 yet, but makes sense to pin to v1.0.0 now

SpangleLabs commented 2 years ago

I've not seen this failure again btw. Thanks for closing the issue

SpangleLabs commented 2 years ago

I'm using version v1.1.0 now, and lately we've had some home network issues and I've had exactly the same situation again.

I get the same log line as before. Logged as fatal, kills the exportarr application, but the docker container remains running. Attempting to stop the docker container does nothing, attempting to kill it does nothing. I can force remove it, but then my docker-compose network complains and cannot be removed, necessitating a full restart of the docker service.

This happens to both my exportarr instances, for radarr and sonarr, which are both on the linuxserver.io docker images, updated this weekend. My docker version is 20.10.3 build 48d30b5 My docker-compose version is 1.17.1 The server is running Ubuntu 18.04.6 LTS, headless and command line and such.

It's incredibly frustrating when this happens, as it means I have to restart my whole docker service and everything else running, all from a single failed network request. So again I find myself wondering why it attempts no retry, and just exits the application at the first sign of trouble. And what use is the system status metric, if exportarr has this behaviour?

onedr0p commented 2 years ago

It seems like this might be a hard problem to debug since I have not seen anyone else report these issues. I would suggest to have prometheus scrape the metrics once every 5-10 minutes because depending on the size of you libraries it could take more than 1 minute to complete gathering metrics.

SpangleLabs commented 2 years ago

Yeah, it's a tricky one! I hadn't spotted any similar tickets, and it's only when our systems are going through some issues. Unfortunately for us, we seem to have had a lot of network issues. Our router's DNS seems to fail once a week, and whenever sonarr is moving files to the external drive, it moves them so fast as to clog up the CPU with iowait. I think the latter is what might be causing issues here, as they're on the same machine and docker network. I can try dropping the prometheus scrape time.

I still don't entirely get the logic of "one dropped request => kill the application" though. Am I misunderstanding the code there? Does the sonarr/radarr status metric indicate anything more than the exporter being online?

SpangleLabs commented 2 years ago

Hmm, I've attempted to reproduce this issue using docker network disconnect, and I haven't seen the same behaviour, and I'm not certain why.

I created a docker-compose file like so:

---
version: "3.2"
services:
  sonarr:
    image: lscr.io/linuxserver/sonarr
    container_name: sonarr
    environment:
      - PUID=1005
      - PGID=1004
      - TZ=Europe/London
    volumes:
      - /var/lib/Sonarr:/config
      - /mnt/md0/samba:/mnt/md0/samba
      - /mnt/ext18tb:/mnt/ext18tb
    ports:
      - 8989:8989
    restart: unless-stopped
  sonarr_exporter:
    image: ghcr.io/onedr0p/exportarr:v1.1.0
    command: ["sonarr"]
    environment:
      PORT: 9707
      URL: "http://192.168.8.142:8989"
      APIKEY: "{redacted}"
    ports:
      - "9707:9707"
    restart: unless-stopped

Raising the whole thing up with docker-compose up -d, and then once it is running, disconnect the sonarr container from the network with docker network disconnect servarr_default sonarr Waiting a moment, and watching the logs from the exportarr container, with docker logs servarr_sonarr_exporter_1 -f, I see the telltale fatal log lines, but then it continues?

time="2022-09-15T10:48:41Z" level=info msg="Sending HTTP request to http://192.168.8.142:8989/api/v3/series"
time="2022-09-15T10:48:41Z" level=fatal msg="An error has occurred during retrieving statistics Get \"http://192.168.8.142:8989/api/v3/ser
ies\": dial tcp 192.168.8.142:8989: connect: connection refused"
time="2022-09-15T10:48:45Z" level=info msg="Listening on 0.0.0.0:9707"
time="2022-09-15T10:49:41Z" level=info msg="Sending HTTP request to http://192.168.8.142:8989/api/v3/queue?page=1"
time="2022-09-15T10:49:41Z" level=info msg="Sending HTTP request to http://192.168.8.142:8989/api/v3/history"
time="2022-09-15T10:49:41Z" level=info msg="Sending HTTP request to http://192.168.8.142:8989/api/v3/rootfolder"
time="2022-09-15T10:49:41Z" level=info msg="Sending HTTP request to http://192.168.8.142:8989/api/v3/system/status"
time="2022-09-15T10:49:41Z" level=info msg="Sending HTTP request to http://192.168.8.142:8989/api/v3/health"
time="2022-09-15T10:49:41Z" level=fatal msg="An error has occurred during retrieving statistics Get \"http://192.168.8.142:8989/api/v3/hea
lth\": dial tcp 192.168.8.142:8989: connect: connection refused"
time="2022-09-15T10:49:45Z" level=info msg="Listening on 0.0.0.0:9707"
time="2022-09-15T10:50:41Z" level=info msg="Sending HTTP request to http://192.168.8.142:8989/api/v3/series"
time="2022-09-15T10:50:41Z" level=info msg="Sending HTTP request to http://192.168.8.142:8989/api/v3/queue?page=1"
time="2022-09-15T10:50:41Z" level=info msg="Sending HTTP request to http://192.168.8.142:8989/api/v3/history"
time="2022-09-15T10:50:41Z" level=info msg="Sending HTTP request to http://192.168.8.142:8989/api/v3/rootfolder"

I thought the fatal log lines were the issue, but the fact it then logs listening on port 9707, suggests it's restarting right away after that. If I then reconnect sonarr to the network, with docker network connect servarr_default sonarr, it goes right back to working fine.

I'm not sure why it seems to send a bunch of http requests before one fails? Maybe I'm misunderstanding that, or maybe it sends them all in parallel and one fails first and kills it before the others have chance to fail.

I notice that this network error is connection refused, rather than EOF though, I'm guessing that's the cause of the difference in behaviour? I wonder whether it's something like not being able to close the connection correctly, or not timing out, and so hanging rather than exiting and restarting?

rtrox commented 1 year ago

These log lines are written before the HTTP requests are sent:

time="2022-09-15T10:49:41Z" level=info msg="Sending HTTP request to http://192.168.8.142:8989/api/v3/health"

The way prometheus works, each Collector is on a different thread, and communicates with the HTTP worker thread over a channel. You'll notice that each of these 5 requests are from a different collector (the sonarr, queue, history, and rootfolder collectors respectively) -- exportarr is making them in parallel. So the likelihood here is that none of the requests are succeeding. The reason you only get one Error back is because of the log.Fatal call -- that call triggers the process to immediately exit, so the other requests aren't completing. This happens every time prometheus tries to scrape exportarr, which in your latest logs is once every 60 seconds.

The two linked PRs remove all log.Fatal() calls from the collectors, so now when there are network issues, the collector will throw 500s, but won't restart. It also adds retries when an error or 500 is thrown communicating with the *arr instance. Between the two, I think we should be good once this lands and makes it into a release.

rtrox commented 1 year ago

Whoops, sorry, didn't mean to close this >.<