pulibrary / pdc_describe

Description application for Research Data content
7 stars 1 forks source link

Better error reporting for files that aren't moving as expected #1574

Open bess opened 1 year ago

bess commented 1 year ago

Currently, we are seeing periodic failures in the movement of files between S3 buckets. These errors often seem to be ephemeral (e.g., redis is unavailable, which causes the job to fail, but it succeeds when we manually retry the job in sidekiq), but some are indicative of a bigger problem that needs intervention. Right now there's no obvious workflow step in the UI to tell us whether the background jobs succeeded. Once a curator clicks "approve" they walk away. Right now Bess is checking the sidekiq queues manually to find errors, as part of support for the data migration, but this isn't a sustainable system.

Could we have a conversation about how to better catch these problems and report them in the UI so they don't fall through the cracks?

Screenshot 2023-10-12 at 11 10 08 AM

hectorcorrea commented 1 year ago

To view the failed jobs queue you can go to: https://pdc-describe-prod.princeton.edu/describe/sidekiq/morgue

If there are jobs in the "dead" queue the work-around to retry these jobs is as follow:

It is important that you access the Sidekiq interface through the tabs that the Capistrano task opened because those tabs point directly to each of the individual servers. If you go through the load balancer the retry does not work.