pulibrary / figgy

Valkyrie-based digital repository backend.
Other
36 stars 4 forks source link

Generating a PDF the first time locks up a Passenger thread, eventually freezing up a Figgy machine. #4876

Closed tpendragon closed 6 months ago

tpendragon commented 2 years ago

If a user eventually tries again because it's slow (or maybe even tries several times), it can lock up an entire box (or multiple boxes.)

We'll need to come up with some way of backgrounding the generation of PDFs and notifying them when it's ready or something.

We discussed the options below and decided to go with option 5: Generate in the background and provide a progress bar (see https://groups.google.com/g/samvera-tech/c/yThEBMzA4_o/m/QTZUz-JYCAAJ)

Prevent generating PDF twice, the job should have a guard that checks to see if the PDF is already being generated and if it is the job exits.

Steps:

tpendragon commented 2 years ago

This caused Figgy to crash for 20 minutes on 1/11/22

tpendragon commented 2 years ago

Thoughts from Slack:

  1. PDF Background Job + Form where they fill in an email for a notification.
  2. Focus a couple boxes specifically to downloads so that manifests don't die for everyone when this freezes things up.
  3. Receive a UV alert when a download is ready (takes 20 minutes)
  4. Pre-generate all PDFs (most of these won't ever get downloaded)
  5. Generate in the background and provide a progress bar (see https://groups.google.com/g/samvera-tech/c/yThEBMzA4_o/m/QTZUz-JYCAAJ)
hackartisan commented 2 years ago

Accidentally closed with a bad ref from an unrelated PR.

tpendragon commented 2 years ago

This is a new feature. We're hoping that users won't be broken by this after #4999 is closed.

tpendragon commented 2 years ago

Happened again by something requesting a PDF 44 times in a couple minutes.

hackartisan commented 10 months ago

This issue is causing intermittent downtime on Figgy once again. I uploaded some logs resulting from a datadog search for get concern scanned_resources pdf and they're here https://drive.google.com/drive/u/2/folders/1CdaMVunK_PZYH2YHMmJHprKFyG_u65xg

You can also see this by looking at APM traces. E.g. right now if I go to figgy's APM trace for ScannedResourceController#pdf and click on that, I see the graphs for latency and errors are correlated.

tpendragon commented 7 months ago

Question to answer:

How do we handle if the sidekiq queue is locked up by long-running ingest jobs. They're not high priority, they're just started and take a long time.

A special process w/ critical queue only? Spawn a new process?

hackartisan commented 7 months ago

this technique seems reasonable to me https://github.com/sidekiq/sidekiq/wiki/Advanced-Options#reserved-queues

I think this means we'll have to introduce another sidekiq service to either a figgy worker box or all the figgy worker boxes to run another process. Here are the relevant ansible files for the current process, I guess we'd just add another one directly in the figgy role.

tpendragon commented 7 months ago

A little research:

I think we're generating a PDF, attaching it to the resource, and redirecting to the download controller for that file attached to the resource. I think this means read only PDFs aren't working, we're just redirecting them to a 404 page.

I'm wondering if we can do the following to reduce polling and extra db stuff:

  1. Implement a PDF status endpoint.
  2. When a user goes to the resource PDF page, initiate an ActionCable connection and populate its initial state with the initial status. When the status reaches "complete", redirect to the DownloadController.
  3. Background a generation of the PDF. In the creation, publish the status every once in a while.
  4. Don't enqueue another PDF generation if one exists. I think @eliotjordan figured out how to do that in the Mosaic improvements code in #6091.
hackartisan commented 7 months ago

Do we need an endpoint? I thought with actioncable you can put whatever code you want to run in the channel object I guess? Either way, it might help to make a process diagram.

hackartisan commented 7 months ago

Oh you're saying use a PDF status endpoint as a place for the browser to land? It's basically just a page that says "the pdf is generating" and has a status bar?

I guess I had envisioned that populating the top of the show page, but given the way things are embedded everywhere maybe a dedicated page makes sense.

tpendragon commented 7 months ago

https://github.com/pulibrary/figgy/compare/main...pdf-actioncable-spike has a spike for the notification system. There's a small flaw in that if it somehow manages to process the PDF before they make it to the progress bar page, then they're just stuck there, but it's a decent proof of concept.

tpendragon commented 7 months ago

Proposal steps:

tpendragon commented 7 months ago

Datadog metrics for PDF requests on our web servers: https://app.datadoghq.com/logs?query=-%22health.json%22%20pdf%20service%3Afiggy%20source%3Anginx%20%22catalog%22%20&analyticsOptions=%5B%22line%22%2C%22dog_classic%22%2Cnull%2Cnull%5D&cols=host%2Cservice&index=%2A&messageDisplay=inline&refresh_mode=sliding&step=1200000&storage=hot&stream_sort=time%2Casc&viz=timeseries&from_ts=1709844222672&to_ts=1711140222672&live=true