psu-libraries / researcher-metadata

Penn State University's faculty and research metadata repository
https://metadata.libraries.psu.edu/
MIT License
7 stars 0 forks source link

healthchecks #700

Open whereismyjetpack opened 1 year ago

whereismyjetpack commented 1 year ago

provide a healthcheck endpoint that checks the health of the application and it's dependencies

I'm going to start with the default okcomputer healthchecks, and add one in for delayed job. i want to look at queue depth, as well as queue failures

later on we might want to check some of our other dependencies as well (orcid, onebutton)

whereismyjetpack commented 1 year ago

Failed Jobs Delayed::Job.where.not(failed_at: nil).length > 0 Queue Depth Delayed::Job.where(created_at: 10.minutes.ago..).where(failed_at: nil).length > threshold

I'm noodling around en irb and This is what i've come up with. thoughts? @EricDurante @ajkiessl

EricDurante commented 1 year ago

The queue depth check looks like a good place to start - we can adjust as needed.

The failed jobs check probably won't be very useful unless we adjust how some other things are currently working. By default, delayed_job automatically deletes jobs that fail, but keeping failed jobs around is very helpful for debugging, so we've overridden that default. So for now, failed jobs are going to stick around indefinitely unless we go clean them up by hand, and we might not want to clean up any given job until we've gotten to the root of what made it fail and fixed the issue. So in that case, the check might be in a failing state so often that it wouldn't be helpful.

On the other hand, there are certainly other ways of preserving that debugging info besides keeping the failed job record in the database. For the existing Scholarsphere upload job, we're already recording a few bits of info in the scholarsphere_work_deposits table when a job fails so that RMD admins have easy access to pertinent info, and in some situations we're logging some additional info that's helpful for debugging, but the failed record in the delayed_jobs table provides still more useful info. So maybe instead of depending on the database to store that info for as long as we need it, we could serialize those records to the log upon failure. We could probably make that the default behavior for every custom job by implementing delayed_job's failure callback in ApplicationJob. Then we'd be free to delete the records immediately after their presence has alerted us to a problem via the health check. Or maybe there's some other altogether better approach for detecting/notifying job failures?

What do you think?

EricDurante commented 1 year ago

I do think it would be great to have some health checks that test our connections to the various 3rd-party APIs.