Closed tpendragon closed 6 months ago
This caused Figgy to crash for 20 minutes on 1/11/22
Thoughts from Slack:
Accidentally closed with a bad ref from an unrelated PR.
This is a new feature. We're hoping that users won't be broken by this after #4999 is closed.
Happened again by something requesting a PDF 44 times in a couple minutes.
This issue is causing intermittent downtime on Figgy once again. I uploaded some logs resulting from a datadog search for get concern scanned_resources pdf
and they're here https://drive.google.com/drive/u/2/folders/1CdaMVunK_PZYH2YHMmJHprKFyG_u65xg
You can also see this by looking at APM traces. E.g. right now if I go to figgy's APM trace for ScannedResourceController#pdf and click on that, I see the graphs for latency and errors are correlated.
Question to answer:
How do we handle if the sidekiq queue is locked up by long-running ingest jobs. They're not high priority, they're just started and take a long time.
A special process w/ critical queue only? Spawn a new process?
this technique seems reasonable to me https://github.com/sidekiq/sidekiq/wiki/Advanced-Options#reserved-queues
I think this means we'll have to introduce another sidekiq service to either a figgy worker box or all the figgy worker boxes to run another process. Here are the relevant ansible files for the current process, I guess we'd just add another one directly in the figgy role.
A little research:
I think we're generating a PDF, attaching it to the resource, and redirecting to the download controller for that file attached to the resource. I think this means read only PDFs aren't working, we're just redirecting them to a 404 page.
I'm wondering if we can do the following to reduce polling and extra db stuff:
Do we need an endpoint? I thought with actioncable you can put whatever code you want to run in the channel object I guess? Either way, it might help to make a process diagram.
Oh you're saying use a PDF status endpoint as a place for the browser to land? It's basically just a page that says "the pdf is generating" and has a status bar?
I guess I had envisioned that populating the top of the show page, but given the way things are embedded everywhere maybe a dedicated page makes sense.
https://github.com/pulibrary/figgy/compare/main...pdf-actioncable-spike has a spike for the notification system. There's a small flaw in that if it somehow manages to process the PDF before they make it to the progress bar page, then they're just stuck there, but it's a decent proof of concept.
Proposal steps:
base/pdf.html.erb
template that renders a loading page.ScannedResourcesController#pdf
render base/pdf.html.erb
if the PDF isn't generated already. (POC: https://github.com/pulibrary/figgy/commit/b5145e62be542e288178574a21e065b3274da256#diff-8e0ad8bd2a2f574caf3ce37dd016ada26ec0185ac3bfccf2c308d9aa0a344c1eR52-R58) (If we put this behind a feature flipper, we might be able to merge this if we wanted.)base/pdf.html.erb
is present. (POC: https://github.com/pulibrary/figgy/commit/b5145e62be542e288178574a21e065b3274da256#diff-6b4397d53010efb20a9c5c7985af1e0987f53d155049fa9784829d7b5cabdf1f){ pct_complete: 100, url: "/downloads/<scanned_resource_id>/file/<pdf_file>" }
, redirect to the URL.pct_complete
is 100. Maybe put this in its own messaging class (POC: https://github.com/pulibrary/figgy/compare/pdf-actioncable-spike#diff-145e2f4bc32330c795f7290067dc6e9b1dcbd178b74b24d876b3262074d64679R26)Rails.cache.fetch("pdf_download_job_id_#{resource.id}", expires_in: 30.minutes) { PDFJob.perform_later(bla).id }
Datadog metrics for PDF requests on our web servers: https://app.datadoghq.com/logs?query=-%22health.json%22%20pdf%20service%3Afiggy%20source%3Anginx%20%22catalog%22%20&analyticsOptions=%5B%22line%22%2C%22dog_classic%22%2Cnull%2Cnull%5D&cols=host%2Cservice&index=%2A&messageDisplay=inline&refresh_mode=sliding&step=1200000&storage=hot&stream_sort=time%2Casc&viz=timeseries&from_ts=1709844222672&to_ts=1711140222672&live=true
If a user eventually tries again because it's slow (or maybe even tries several times), it can lock up an entire box (or multiple boxes.)
We'll need to come up with some way of backgrounding the generation of PDFs and notifying them when it's ready or something.
We discussed the options below and decided to go with option 5: Generate in the background and provide a progress bar (see https://groups.google.com/g/samvera-tech/c/yThEBMzA4_o/m/QTZUz-JYCAAJ)
Prevent generating PDF twice, the job should have a guard that checks to see if the PDF is already being generated and if it is the job exits.
Steps: