Open ShanaLMoore opened 1 year ago
Not sure if this is related, but some audio file types process and play back well (https://franklin.hykucommons.org/concern/generic_works/b6e70708-0026-47ea-9e7c-f06c809455bc?q=douglas%20gray) while others don't (https://franklin.hykucommons.org/concern/generic_works/c5bee722-7d44-4030-bca4-77b3081ac051?q=judi%20warren)
One thing to consider is that Hyrax by default creates two derivatives for a video: webm
and mp4
. That’s twice the amount of derivative generation.
Maybe we could see about removing one of those format types. The following code is the reference: https://github.com/samvera/hyrax/blob/b8c4fa4c8fddbb4d4d4b89fc4b514bd6d5d83928/app/services/hyrax/file_set_derivatives_service.rb#L98-L103
I wrote and ran the this script to unblock PALS in the meantime. It finds the mp4 jobs and schedules them to a later date, to unblock the bottleneck.
The underlying issue appears to be a CPU bottleneck. Long-running ffmpeg
commands are most likely being throttled.
When an audio or video derivative job is triggered, we should put in into a separate Sidekiq queue (e.g. "ffmpeg"). After that, at an ops level, have the default 3 workers run all other queues only. Then have a 4th, separate worker that runs all the other queues plus the "ffmpeg" queue. The new, additional worker has a CPU limit of 4 (higher than default) and 1 thread (lower than default).
This effectively creates a powerful "slow lane". "ffmpeg" jobs will slow down the worker while they're running, but they won't bog down all the jobs since the other three workers are still running.
Make the "ffmpeg" queue priority equal to the default queue. If Steve starts processing a PDF at 10am, and Billy starts processing a video at 11am, it makes sense that the video should have to wait for the PDFs to finish. For Billy, the real-life difference between waiting 20 minutes and 60 minutes is negligible.
SoftServ QA: ✅
(Video created)[https://dev.commons-archive.org/concern/oers/a2283fe4-e88f-4bdd-96a3-e5a8641218a4?locale=en] Took a total of 1 hour and 15ish minutes to run the jobs.
Both worker
and workerAuxiliary
deployments fail. The pods spawn and immediately get stuck in an infinite loop trying to connect to redis. The GitHub deployment action itself fails (example) with this error:
Error: UPGRADE FAILED: an error occurred while rolling back the release. original upgrade error: timed out waiting for the condition: no ConfigMap with the name "palni-palci-demo-redis" found
A potential cause of this issue could be that the Hyrax chart version was upgraded (diff). Worth noting is that base Hyku is using the same chart version we are and doesn't seem to have this issue
Summary
⚠️ This temp fix needs to be undone to properly work on this issue:
ref slack convo: https://assaydepot.slack.com/archives/C0313NK5NMA/p1701391872277629 more context here
tldr; 1hr long videos takes days to process
Nic had a client demo and noticed that the AV was not loading his video, the day before the meeting. He had assumed the 59 minute video had finished processing because he uploaded it the day before.
After looking into it, Kirk and Shana discovered that the video was still processing.
We implemented a hack in the meantime, to help Nic be successful with his demo, however Rob requested we make a ticket for him to look into what appears to be a bug with processing.
Additionally we wrote a script to remove mp4 related jobs, and Nic has asked their customers to not upload mp4s. These jobs have a schedule_at date 6 months from now.
After this work is done, the dev should respawn the mp4 jobs we took out of the queue, to unblock everything else.
related
Questions
Is it OK to leave our current hack in place, indefinitely? If not, create a ticket for us to undo our "hack", after this ticket has been addressed.
Screenshot
Testing Instructions
/sidekiq/busy
. Click the Live Poll button in the top right (should be green)CreateLargeDerivativesJob
shows up at some point and that it gets put in theauxilliary
queueCreateLargeDerivativesJob
does not fail (i.e. it should not go back and forth between the Retries queue and the Busy queue) a) If the sameCreateLargeDerivativesJob
keeps disappearing and reappearing on the Busy page, check the Retries tab of the Sidekiq dashboard to see if it keeps showing up there. This is an indication that it is failing and retrying over and overCreateLargeDerivativesJob
does not take "too long" a) A video between 5-10 minutes should not take longer than 6 hours to processNotes
related convo: https://assaydepot.slack.com/archives/C0313NKC08L/p1697141000008749