Improve alerting and sanity checking on ingest

microbiomedata / nmdc-server

Data portal client and server for NMDC.

https://data.microbiomedata.org

Other

9 stars 0 forks source link

Improve alerting and sanity checking on ingest #328

Closed jeffbaumes closed 4 days ago

jeffbaumes commented 3 years ago

The ingest can check for more problems and alert people to potential issues.

Just capturing the idea, no action needed yet.

jeffbaumes commented 2 years ago

Investigate:

Making this a shell script that outputs to Rancher
If it is kept as a celery task, output celery logs to Rancher
Cover more exception cases in a try/except so ingest can keep going more often

naglepuff commented 1 year ago

I haven't dug too deep but this SO question and answer might be worth exploring. Essentially we can create a class that inherits from Task and do something custom with the failure cases, such as printing the exception to the log in addition to storing it in the results backend.

naglepuff commented 1 year ago

We're also seeing an issue where sometimes running an ingest doesn't log anything to the worker. We saw it recently when running a full ingest shortly after a partial ingest completed. This difference shouldn't impact the logging, however.

eecavanna commented 5 days ago

I have a concrete plan for updating the ingest code so it posts a message on Slack when ingest is done. I'll create a branch off of this ticket, implement what I have in mind there, and then open a PR.

The plan I have is to create a Slack app called something like "Ingest watcher" (final name TBD), create a webhook URL associated with a Slack channel (e.g. #ingest-notifications or one of our existing channels—also TBD), define an nmdc-server config variable that can be used to get that webhook URL from an environment variable, and then update the ingest CLI code to—before it prints "Done" to the console—send an HTTP request to that webhook URL, which will post a message on Slack.

eecavanna commented 4 days ago

I closed this ticket when PR #1466 got merged in. That PR makes it so the ingester posts a message to Slack when ingest finishes running successfully.

Given that this ticket was not limited to notifications of success (but also failure and maybe other events), I have created a new ticket, which is about updating the ingester to also post a message to Slack when ingest fails. That new ticket is: https://github.com/microbiomedata/nmdc-server/issues/1467

I will leave this ticket closed.