ngosang / restic-exporter

Prometheus exporter for the Restic backup system
MIT License
79 stars 17 forks source link

The restic command seems to hang #33

Closed zoechi closed 1 week ago

zoechi commented 2 weeks ago

I back to Hetzner using the sftp.command (#31). So for restic-exporter to work I mount the storage to a directory.

journalctl -xfu prometheus-restic-exporter

shows this with no further output

Sep 06 08:01:01 myserver systemd[1]: Started prometheus-restic-exporter.service.
░░ Subject: A start job for unit prometheus-restic-exporter.service has finished successfully
░░ Defined-By: systemd
░░ Support: https://lists.freedesktop.org/mailman/listinfo/systemd-devel
░░
░░ A start job for unit prometheus-restic-exporter.service has finished successfully.
░░
░░ The job identifier is 65000.
Sep 06 08:01:01 myserver prometheus-restic-exporter-start[273523]: 2024-09-06 08:01:01 INFO     Starting Restic Prometheus Exporter
Sep 06 08:01:01 myserver prometheus-restic-exporter-start[273523]: 2024-09-06 08:01:01 INFO     It could take a while if the repository is remote
ps aux | grep restic

shows

root      273754  2.3  1.1 4296735584 278484 ?   Sl   08:01   0:14 /nix/store/7mzcrjbpsp8nm2i103lfbhkglwcva4mg-restic-0.17.0/bin/restic -r /mnt/restic_storagebox/backups/myserver -p /run/credentials/prometheus-restic-exporter.service/RESTIC_PASSWORD_FILE --no-lock stats --json 74fbf530ad47c94aebd884fa06a99f520290225e2142bc5853731c5733c14563

When I execute the command manually after about 15sec I get

{"total_size":249796868116,"total_file_count":1103048,"snapshots_count":1}

Any idea why the command might be stuck when run from the exporter and why the exporter never starts listening on the configured port?

[Service]
Environment="LISTEN_ADDRESS=127.0.0.1"
Environment="LISTEN_PORT=9753"
User=root
...

Any suggestions about how to debug?

zoechi commented 1 week ago

It took 15 hours from starting the restic stats command until it starts serving on the configured address+port. I looked a bit at the code and it looks like a pure Python subprocess.run issue, but I have no experience with Python. Even when it runs such a long time it seems to complete successfully in the end otherwise an exception would be thrown.

Sep 13 22:07:35 smarthome prometheus-restic-exporter-start[1022]: 2024-09-13 22:07:35 INFO     Starting Restic Prometheus Exporter
Sep 13 22:07:35 smarthome prometheus-restic-exporter-start[1022]: 2024-09-13 22:07:35 INFO     It could take a while if the repository is remote
Sep 14 13:07:27 smarthome prometheus-restic-exporter-start[1022]: 2024-09-14 13:07:27 INFO     Serving at http://0.0.0.0:9753
Sep 14 13:07:27 smarthome prometheus-restic-exporter-start[1022]: 2024-09-14 13:07:27 INFO     Refreshing stats every 3600 seconds

Perhaps some more log output about what it does could help in case it does more that I think (multiple calls). From what I observed it's always the same process and not one call per snapshot or whatever. The hash is always the same when I check the process list.

zoechi commented 1 week ago

I think now that this is a SystemD issue which limits resources for services. I'm going to close but will update when I figure out how to fix this.

zoechi commented 6 days ago

I found https://github.com/restic/restic/issues/4591
I ensured the RESTIC_CACHE_DIR env var is set to the same directory as when restic backups are executed. When I run the command manually, the cache directory's changed timestamp is updated, when it's run from the exporter in the systemd service it is not updated. So it might be the missing cache dir but it's not clear yet why.

zoechi commented 6 days ago

The cause was that the cache directory was read only.
This was caused by NixOS setting ProtectSystem=strict that is set for all Prometheus exporter's systemd services.