ShanaLMoore commented 1 year ago

Summary

⚠️ This temp fix needs to be undone to properly work on this issue:

https://github.com/scientist-softserv/palni-palci/issues/924

ref slack convo: https://assaydepot.slack.com/archives/C0313NK5NMA/p1701391872277629 more context here

tldr; 1hr long videos takes days to process

Nic had a client demo and noticed that the AV was not loading his video, the day before the meeting. He had assumed the 59 minute video had finished processing because he uploaded it the day before.

After looking into it, Kirk and Shana discovered that the video was still processing.

We implemented a hack in the meantime, to help Nic be successful with his demo, however Rob requested we make a ticket for him to look into what appears to be a bug with processing.

Additionally we wrote a script to remove mp4 related jobs, and Nic has asked their customers to not upload mp4s. These jobs have a schedule_at date 6 months from now.

After this work is done, the dev should respawn the mp4 jobs we took out of the queue, to unblock everything else.

https://github.com/scientist-softserv/palni-palci/issues/853

Questions

Is it OK to leave our current hack in place, indefinitely? If not, create a ticket for us to undo our "hack", after this ticket has been addressed.

Screenshot

Testing Instructions

Login as an admin
Open a new tab and navigate to /sidekiq/busy. Click the Live Poll button in the top right (should be green)
Create a new work. Give it a video file a) I (@bkiahstroud) recommend a video of ~5 minutes in length to test first (see Step 7a)
Quickly switch tabs back to the Sidekiq dashboard (step 2). Watch the jobs as they process (i.e. appear and disappear)
Verify that a job called CreateLargeDerivativesJob shows up at some point and that it gets put in the auxilliary queue
Verify that the CreateLargeDerivativesJob does not fail (i.e. it should not go back and forth between the Retries queue and the Busy queue) a) If the same CreateLargeDerivativesJob keeps disappearing and reappearing on the Busy page, check the Retries tab of the Sidekiq dashboard to see if it keeps showing up there. This is an indication that it is failing and retrying over and over
Verify that the CreateLargeDerivativesJob does not take "too long" a) A video between 5-10 minutes should not take longer than 6 hours to process
(Dev only) Verify that the derivative file(s) get created successfully
Repeat steps 3-8 with an audio file instead of a video

Notes

ndroark commented 1 year ago

Not sure if this is related, but some audio file types process and play back well (https://franklin.hykucommons.org/concern/generic_works/b6e70708-0026-47ea-9e7c-f06c809455bc?q=douglas%20gray) while others don't (https://franklin.hykucommons.org/concern/generic_works/c5bee722-7d44-4030-bca4-77b3081ac051?q=judi%20warren)

jeremyf commented 12 months ago

One thing to consider is that Hyrax by default creates two derivatives for a video: webm and mp4. That’s twice the amount of derivative generation.

Maybe we could see about removing one of those format types. The following code is the reference: https://github.com/samvera/hyrax/blob/b8c4fa4c8fddbb4d4d4b89fc4b514bd6d5d83928/app/services/hyrax/file_set_derivatives_service.rb#L98-L103

ShanaLMoore commented 11 months ago

I wrote and ran the this script to unblock PALS in the meantime. It finds the mp4 jobs and schedules them to a later date, to unblock the bottleneck.

bkiahstroud commented 10 months ago

The problem

The underlying issue appears to be a CPU bottleneck. Long-running ffmpeg commands are most likely being throttled.

Proposed fix

When an audio or video derivative job is triggered, we should put in into a separate Sidekiq queue (e.g. "ffmpeg"). After that, at an ops level, have the default 3 workers run all other queues only. Then have a 4th, separate worker that runs all the other queues plus the "ffmpeg" queue. The new, additional worker has a CPU limit of 4 (higher than default) and 1 thread (lower than default).

This effectively creates a powerful "slow lane". "ffmpeg" jobs will slow down the worker while they're running, but they won't bog down all the jobs since the other three workers are still running.

Make the "ffmpeg" queue priority equal to the default queue. If Steve starts processing a PDF at 10am, and Billy starts processing a video at 11am, it makes sense that the video should have to wait for the PDFs to finish. For Billy, the real-life difference between waiting 20 minutes and 60 minutes is negligible.

jillpe commented 9 months ago

SoftServ QA: ✅

(Video created)[https://dev.commons-archive.org/concern/oers/a2283fe4-e88f-4bdd-96a3-e5a8641218a4?locale=en] Took a total of 1 hour and 15ish minutes to run the jobs.

bkiahstroud commented 9 months ago

Blocked

Both worker and workerAuxiliary deployments fail. The pods spawn and immediately get stuck in an infinite loop trying to connect to redis. The GitHub deployment action itself fails (example) with this error:

Error: UPGRADE FAILED: an error occurred while rolling back the release. original upgrade error: timed out waiting for the condition: no ConfigMap with the name "palni-palci-demo-redis" found

A potential cause of this issue could be that the Hyrax chart version was upgraded (diff). Worth noting is that base Hyku is using the same chart version we are and doesn't seem to have this issue