microbiomedata / nmdc-server

Data portal client and server for NMDC.
https://data.microbiomedata.org
Other
9 stars 0 forks source link

Clean up logging statements emitted by ingest process #1194

Open pkalita-lbl opened 7 months ago

pkalita-lbl commented 7 months ago

Apologies if this is partially overlapping with #328.

Currently we run the data portal ingest (1) automatically via a Rancher CronJob (daily on dev, weekly on prod) and (2) by manually triggering the CronJob as part of the monthly scheduled release after new nmdc-server code is deployed. If an automated ingest fails we generally have to consult the logs to understand what when wrong. When we run the ingest manually we generally keep an eye on the logs as it runs to ensure everything goes okay. Both of these tasks are slightly hampered by the noisiness of the logs.

Messages like these are commonly logged at the ERROR or WARNING level, but don't actually indicate serious problems with the ingest:

Unexpected type nmdc:ReadBasedTaxonomyAnalysisActivity (expected nmdc:ReadbasedAnalysis)
Encountered pipeline with no associated omics_processing: gold:Gp0321411
Unknown data object nmdc:a604c87c632165bb5223eebda60801d0 for nmdc:8e2d8da1d05b292a52a33732a6bc4391

These messages do act as a sort of informal "progress indicator".

So some goals of this issue:

eecavanna commented 5 months ago

One part of the ingest that I find confusing is the 30-minute period of "silence" (no console output), which occurs towards the end of the ingest process. I propose introducing some "sign of life" indicator for that—something that tells the user "I'm still working." When I encountered that period of silence during today's release, I thought ingest had failed (it hadn't).

eecavanna commented 1 month ago

The CLI command used to run ingest (on Spin) is:

nmdc-server ingest -vv --function-limit=0 --swap-rancher-secrets

That CLI command is implemented in: /nmdc_server/cli.py#L69

Two things that I think we could do to make the ingest process easier to monitor, are:

  1. Wrap most of the function within a try/except, where, when an exception gets thrown, a helper function that posts a message to Slack using the Slack API is called (if a Slack API URL is defined in the environment)
  2. After the currently-final instruction in the function (i.e. the instruction that prints "Done" to the console), call that same helper function to post a "Done" message to Slack

Assuming failures result in exceptions being thrown, the person monitoring the ingest process could now ignore the console and keep an eye on (or subscribe to a channel on) Slack instead.

I have experience posting messages to Slack via the Slack API (e.g. the messages sent by "Website health checker" in the #updown Slack channel).