Open bess opened 1 year ago
To view the failed jobs queue you can go to: https://pdc-describe-prod.princeton.edu/describe/sidekiq/morgue
If there are jobs in the "dead" queue the work-around to retry these jobs is as follow:
Connect to the VPN
From your machine, run the capistrano task to launch the Sidekiq console from your machine. This will open two tabs on your browser, once for each server where Sidekiq is running:
cap production sidekiq:console
Go to the first tab that Capistrano opened (e.g. http://localhost:nnnn/describe/sidekiq/morgue) and click the "retry all" button at the bottom of the page. The jobs usually succeed when retried. (Be careful the "delete all" button is next to the "retry all" button, don't click it).
Go to the second tab that Capistrano opened (e.g. http://localhost:xxxx/describe/sidekiq/morgue) and do the same.
It is important that you access the Sidekiq interface through the tabs that the Capistrano task opened because those tabs point directly to each of the individual servers. If you go through the load balancer the retry does not work.
Currently, we are seeing periodic failures in the movement of files between S3 buckets. These errors often seem to be ephemeral (e.g., redis is unavailable, which causes the job to fail, but it succeeds when we manually retry the job in sidekiq), but some are indicative of a bigger problem that needs intervention. Right now there's no obvious workflow step in the UI to tell us whether the background jobs succeeded. Once a curator clicks "approve" they walk away. Right now Bess is checking the sidekiq queues manually to find errors, as part of support for the data migration, but this isn't a sustainable system.
Could we have a conversation about how to better catch these problems and report them in the UI so they don't fall through the cracks?