Open pkalita-lbl opened 7 months ago
One part of the ingest that I find confusing is the 30-minute period of "silence" (no console output), which occurs towards the end of the ingest process. I propose introducing some "sign of life" indicator for that—something that tells the user "I'm still working." When I encountered that period of silence during today's release, I thought ingest had failed (it hadn't).
The CLI command used to run ingest (on Spin) is:
nmdc-server ingest -vv --function-limit=0 --swap-rancher-secrets
That CLI command is implemented in: /nmdc_server/cli.py#L69
Two things that I think we could do to make the ingest process easier to monitor, are:
try/except
, where, when an exception gets thrown, a helper function that posts a message to Slack using the Slack API is called (if a Slack API URL is defined in the environment)Assuming failures result in exceptions being thrown, the person monitoring the ingest process could now ignore the console and keep an eye on (or subscribe to a channel on) Slack instead.
I have experience posting messages to Slack via the Slack API (e.g. the messages sent by "Website health checker" in the #updown
Slack channel).
Apologies if this is partially overlapping with #328.
Currently we run the data portal ingest (1) automatically via a Rancher CronJob (daily on dev, weekly on prod) and (2) by manually triggering the CronJob as part of the monthly scheduled release after new
nmdc-server
code is deployed. If an automated ingest fails we generally have to consult the logs to understand what when wrong. When we run the ingest manually we generally keep an eye on the logs as it runs to ensure everything goes okay. Both of these tasks are slightly hampered by the noisiness of the logs.Messages like these are commonly logged at the
ERROR
orWARNING
level, but don't actually indicate serious problems with the ingest:These messages do act as a sort of informal "progress indicator".
So some goals of this issue: