unioslo / tsdfx

File transfer utility
Other
4 stars 3 forks source link

Map reload can cause daemon to die if file systems are unavailable #90

Closed petterreinholdtsen closed 8 years ago

petterreinholdtsen commented 8 years ago

When reloading tsdfx while reloading autofs on a machine where most file systems are provided by autofs, the tsdfx process can die because some of the directories mentioned in the map file are missing. This is unfortunate, and was discovered in production.

Not quite sure what should be done in such situation, but trying again later or dropping the map entry might be better than dying.

dag-erling commented 8 years ago

Isn't this a duplicate of #86?

petterreinholdtsen commented 8 years ago

Nope. There is no segfault involved. Just tsdfx noticing a mapping refering to non-existing directories and exiting. Perhaps such error should be a less fatal error during reload?

dag-erling commented 8 years ago

But where does it exit? The current code path is:

Nowhere does a failure to access the source directory of a mapping cause tsdfx to exit.

petterreinholdtsen commented 8 years ago

Here is an example from the log where tsdfx failed to restart. I was unable to find the logs where tsdfx reload caused it to exit (probably rotated away), but a similar problem happened when we only restarted autofs. Now we stop tsdfx, restart autofs, and start tsdfx after autofs is started.

Mar 14 01:30:02 tsd-fx02 tsdfx[84120]: tsdfx.c:94 tsdfx_exit() tsdfx stopping Mar 14 01:31:03 tsd-fx02 automount[102853]: do_reconnect: lookup(ldap): failed to find available server Mar 14 01:31:33 tsd-fx02 tsdfx[102901]: tsdfx.c:76 tsdfx_init() tsdfx starting Mar 14 01:31:33 tsd-fx02 tsdfx[102901]: map.c:271 tsdfx_map_reload() loading /opt/tsd/etc/tsdfx.map Mar 14 01:46:33 tsd-fx02 tsdfx[102901]: map.c:113 map_new() /opt/tsd/etc/tsdfx.map:295: invalid destination path Mar 14 03:00:59 tsd-fx02 kernel: mount: server nfs-server not responding, timed out Mar 14 03:07:20 tsd-fx02 [109558]: CFEngine(agent) Unable to create link '/cluster' -> '/net/nfs-server/shared/cluster', no source Mar 14 03:07:20 tsd-fx02 [109558]: CFEngine(agent) Method 'tsd_project_mounts' failed in some repairs Mar 14 03:07:26 tsd-fx02 [109558]: CFEngine(agent) Method 'maintenance' failed in some repairs Mar 14 03:32:53 tsd-fx02 [111877]: CFEngine(agent) Unable to create link '/cluster' -> '/net/nfs-server/shared/cluster', no source Mar 14 03:32:53 tsd-fx02 [111877]: CFEngine(agent) Method 'tsd_project_mounts' failed in some repairs Mar 14 03:32:59 tsd-fx02 [111877]: CFEngine(agent) Method 'maintenance' failed in some repairs Mar 14 03:37:37 tsd-fx02 [112042]: CFEngine(agent) Unable to create link '/cluster' -> '/net/nfs-server/shared/cluster', no source Mar 14 03:37:37 tsd-fx02 [112042]: CFEngine(agent) Method 'tsd_project_mounts' failed in some repairs Mar 14 03:37:42 tsd-fx02 [112042]: CFEngine(agent) Method 'maintenance' failed in some repairs Mar 14 04:52:53 tsd-fx02 [117243]: CFEngine(agent) Unable to create link '/cluster' -> '/net/nfs-server/shared/cluster', no source Mar 14 04:52:53 tsd-fx02 [117243]: CFEngine(agent) Method 'tsd_project_mounts' failed in some repairs Mar 14 04:52:58 tsd-fx02 [117243]: CFEngine(agent) Method 'maintenance' failed in some repairs Mar 14 08:57:47 tsd-fx02 tsdfx[135847]: tsdfx.c:76 tsdfx_init() tsdfx starting Mar 14 08:57:47 tsd-fx02 tsdfx[135847]: map.c:271 tsdfx_map_reload() loading /opt/tsd/etc/tsdfx.map Mar 14 08:59:45 tsd-fx02 tsdfx[135987]: tsdfx.c:76 tsdfx_init() tsdfx starting Mar 14 08:59:45 tsd-fx02 tsdfx[135987]: map.c:271 tsdfx_map_reload() loading /opt/tsd/etc/tsdfx.map Mar 14 08:59:45 tsd-fx02 tsdfx[135987]: map.c:113 map_new() /opt/tsd/etc/tsdfx.map:57: invalid destination path

Happy hacking Petter Reinholdtsen

dag-erling commented 8 years ago

So we're talking about a restart, not reload. Failure to load the map file at startup is (intentionally) fatal, cf. bin/tsdfx/main.c:161.

petterreinholdtsen commented 8 years ago

[Dag-Erling Smørgrav]

So we're talking about a restart, not reload.

I suspect you misunderstand the issue. The initial problem was triggered without any reload and restart of tsdfx, as it would exit when autofs was restarted while tsdfx was running. But was I said, those log are gone now, so I provide the logs for a similar crash that still happen, but this time when restarting tsdfx.

Happy hacking Petter Reinholdtsen

dag-erling commented 8 years ago

I suspect you misunderstand the issue. The initial problem was triggered without any reload and restart of tsdfx, as it would exit when autofs was restarted while tsdfx was running.

This directly contradicts the title and your initial comment.

But was I said, those log are gone now, so I provide the logs for a similar crash that still happen, but this time when restarting tsdfx.

The logs you provided do not show a crash. They show tsdfx refusing to start due to an invalid configuration. All known bugs which might justify periodically restarting tsdfx have been fixed (#72, #76, #81).

petterreinholdtsen commented 8 years ago

[Dag-Erling Smørgrav]

The logs you provided do not show a crash. They show tsdfx refusing to start due to an invalid configuration. All known bugs which might justify periodically restarting tsdfx have been fixed (#72, #76, #81)

We are not talking about a crash, but the death of the tsdfx daemon. As far as I could tell the tsdfx daemon would some times exit when autofs was restarted, and reading the log made me believe it was because some directories were temporarily missing. This problem caused tsdfx to stop working several days in a row after we started running 'service autofs restart' from cron, and we worked around the issue by running 'service tsdfx stop; service autofs restart; service tsdfx start'. While this took place we started monitoring the tsdfx daemon using Zabbix to discover quickly when it died.

I suspect we do not really want tsdfx to die even if one of its sources or destinations are temporarely missing, and only if they are permanently missing, and the question at hand is how we want that situation to be handled if it should be different from today, and if there is a better way to handle it than dying.

I am not sure if tsdfx died on its own without any signals or when it was reloading, but I do know it died several nights in a row after we started restarting autofs.

I'm unable to understand what you are arguing for, or why you closed this issue. Can you explain a bit more? It is perfectly fine with me if tsdfx is supposed to die when autofs is restarted, but we should make that a concious choice and document it if that is the case. It was a surprise to me when we ran into the problem, and I hope there is a better way to handle it.

Happy hacking Petter Reinholdtsen

dag-erling commented 8 years ago

The issue of tsdfx occasionally dying after reloading the map file was fixed in #72.

The “dead scan task” issue which was worked around by periodically restarting tsdfx was fixed in #76 and #81. You should stop periodically restarting tsdfx, as doing so interferes with proper scheduling of scan and copy tasks and will result in CPU and I/O spikes.

Your claim that tsdfx exits on reload if a directory listed in the map file does not exist is false. You admitted to not having any evidence, beyond vague recollection, of it actually occurring. If you do come across tangible evidence, please open a new issue with a relevant title and description.

petterreinholdtsen commented 8 years ago

Right. Then I understand your view on your fellow developers and this issue, and it make me sad.

Anyway, I've reverted the workaround on the tsdfx server, making it only restart autofs and not tsdfx at night, to trigger the problem again and allow us to collect the log entries I saw earlier.

Happy hacking Petter Reinholdtsen

petterreinholdtsen commented 8 years ago

[Petter Reinholdtsen]

Anyway, I've reverted the workaround on the tsdfx server, making it only restart autofs and not tsdfx at night, to trigger the problem again and allow us to collect the log entries I saw earlier.

The server died 01:30 on Saturday. The log is too long and contain too many details about project file names for me to place it here. The last 1.5 hour before the crash is stored in 20160409-tsdfx-crash-log.txt on the tsdfx server.

I've reinserted the workaround (ie tsdfx stop, autofs restart; tsdfx start) again to avoid the crash until we figure out what is going on.

Happy hacking Petter Reinholdtsen