Clean up logging statements emitted by ingest process

pkalita-lbl commented 7 months ago

Apologies if this is partially overlapping with #328.

Currently we run the data portal ingest (1) automatically via a Rancher CronJob (daily on dev, weekly on prod) and (2) by manually triggering the CronJob as part of the monthly scheduled release after new nmdc-server code is deployed. If an automated ingest fails we generally have to consult the logs to understand what when wrong. When we run the ingest manually we generally keep an eye on the logs as it runs to ensure everything goes okay. Both of these tasks are slightly hampered by the noisiness of the logs.

Messages like these are commonly logged at the ERROR or WARNING level, but don't actually indicate serious problems with the ingest:

Unexpected type nmdc:ReadBasedTaxonomyAnalysisActivity (expected nmdc:ReadbasedAnalysis)

Encountered pipeline with no associated omics_processing: gold:Gp0321411

Unknown data object nmdc:a604c87c632165bb5223eebda60801d0 for nmdc:8e2d8da1d05b292a52a33732a6bc4391

These messages do act as a sort of informal "progress indicator".

So some goals of this issue:

While ingest is running, do not log data consistency issues that do not prevent the ingest from finishing. These can be saved and reported at the end or dropped into a separate file. This end-of-ingest report should also indicate what should be done to resolve them.
While ingest is running, do log periodic progress indicators to help a person monitoring the logs understand what is happening. Use https://github.com/tqdm/tqdm perhaps?
Ensure that if ingest encounters a data issue that cannot be recovered from, it raises an exception to halt the ingest.

eecavanna commented 5 months ago

One part of the ingest that I find confusing is the 30-minute period of "silence" (no console output), which occurs towards the end of the ingest process. I propose introducing some "sign of life" indicator for that—something that tells the user "I'm still working." When I encountered that period of silence during today's release, I thought ingest had failed (it hadn't).

eecavanna commented 1 month ago

The CLI command used to run ingest (on Spin) is:

nmdc-server ingest -vv --function-limit=0 --swap-rancher-secrets

That CLI command is implemented in: /nmdc_server/cli.py#L69

Two things that I think we could do to make the ingest process easier to monitor, are:

Wrap most of the function within a try/except, where, when an exception gets thrown, a helper function that posts a message to Slack using the Slack API is called (if a Slack API URL is defined in the environment)
After the currently-final instruction in the function (i.e. the instruction that prints "Done" to the console), call that same helper function to post a "Done" message to Slack

Assuming failures result in exceptions being thrown, the person monitoring the ingest process could now ignore the console and keep an eye on (or subscribe to a channel on) Slack instead.

I have experience posting messages to Slack via the Slack API (e.g. the messages sent by "Website health checker" in the #updown Slack channel).

microbiomedata / nmdc-server

Clean up logging statements emitted by ingest process #1194