sciencehistory / scihist_digicoll

Science History Institute Digital Collections
Other
13 stars 0 forks source link

CreateCombinedAudioDerivativesJob uses a lot of memory for longer oral histories #1529

Closed eddierubeiz closed 2 years ago

eddierubeiz commented 2 years ago

CreateCombinedAudioDerivativesJob ran out of memory this morning during the creation of the Maureen Charron oral history. @sunicolita ran it again an hour later and the job succeeded, so we're not facing an immediate problem. Still these derivatives are straining the memory resources of our current Heroku dynos, and this ticket is about whether we can make these jobs more reliable.

About the problem:

  1. For testing purposes, the code to run is: CombinedAudioDerivativeCreator.new(Work.find_by_friendlier_id('c0lar63')).generate.

  2. The problem occurs in the underlying call to ffmpeg_transformer.rb, and takes the form of a series of Error R14 (Memory quota exceeded) errors, followed in some cases by a TTY::Command::ExitError as the derivative creator job fails.

jrochkind commented 2 years ago

Hypothetically the Error R14 (Memory quota exceeded) alone won't always result in a problem -- heroku does NOT shut down your process for an R14. It will be running on swap so running very slowly, which may have somehow resulted in ffmpeg deciding it was an error. It's kind of a mystery why it actually failed this time.

So if you only see R14 without the actual error, it probably didn't actually fail. The TTY::Command::ExitError is what meant a failure happened, the R14 alone doesn't mean that.

On the other hand, if you see R15 - Memory quota vastly exceeded (which we haven't knowingly for this problem yet) -- that actually means heroku killed your processes, and something probably failed -- and it likely won't even be logged as a further exception, cause heroku killed it hard before it could even log or register an error with honeybadger etc.

jrochkind commented 2 years ago

One idea... what if the out of memory happens only/mostly when we have multiple busy workers on a worker dyno?

We could totally make a new queue just for combined audio derivative, set to only have one worker at a time, and have hirefire scale up from 0 for it. At cost of waiting longer (possibly up to 2-3 minutes) for your combined audio derivatives to get created.

To ensure there is only ONE worker running at at ime on the dyno, not competing for RAM with others. (We probably still wouldn't want to use a bigger than standard-2x dyno though for cost.... although could compute the cost for larger dynos that are only going to be scaled up occasionally for such jobs, maybe it would still be ok?)

Nic confirms it's no big deal to them if they have to wait ~4 minutes longer for the audio deriv to be created. (Waiting for hirefire to notice a scale up is needed, and then the time it takes to start up a heroku dyno)

jrochkind commented 2 years ago

We also still could do some investigations into if there's a way to get ffmpeg to do this task with less RAM. How do we find an ffmpeg expert to ask? Stackoverflow?

Or if there's some software other than ffmpeg that could do this task with less RAM. But i know ffmpeg is very popular/industry standard for A/V manipulations.

jrochkind commented 2 years ago

These Heroku R14 out of memory error messages from a worker dyno are happening a lot today. (confirmed combined audio derivs were being created today, so feel safe in my belief that's what caused the R14's)

It hasn't interfered with any actual production of audio derivatives, but it's on the edge. @apinkney0696 while not urgent priority, we should probably consider this somewhat high priority, to take a look at in the next couple months.

I'm going to just put it at the top of "backlog" for now.

jrochkind commented 2 years ago

OK, we had some more of these today which allowed me to investigate more.

The "problem" is that the process to create a combined derivative can:

[I think the ones that are having problems are very-long (6-12 file and 6-12 hour) interviews, which are also FLACs. (The mp3 sources seem to compute into combined derivatives fairly fast without RAM problems). ]

These facts can result in two classes of "failure" or "problem"

Some evidence

One work that had combined oral history fail was https://digital.sciencehistory.org/works/uzh0kyb.

It has 14 FLAC files, a total of around 12G.

Creating a combined derivative on my macbook took around 45 minutes -- not sure how much of this time was just pulling down the FLACs from S3, I think that alone could be taking like 10 minutes or more!

If I tried to create the combined derivative in a one-off standard-2x dyno, it generated lots of R14 errors, even though it was the only thing running in that dyno. It didn't ever get to heroku-kill R15 in the 40 minutes I waited for it -- but n our actual worker dynos with 3 workers, you could easily see this happenign if all three are working on an excessive RAM task. I killed it before it completed.

I tried again on a performance-m dyno, and no heroku R14 or R15 errors were generated. It still took 40 minutes to complete. This is a low process.

Ameliorations

Short/Medium term

Medium term

Longer term

jrochkind commented 2 years ago

@CHF-IT-Staff I have unchecked “scale-down non-empty queues” in hirefire.

This will make hirefire be a bit slower in reducing our dynos when the bg worklaod goes down — but should significantly reduce the number of times it interupts a long-running bg job (like our combined derivatives). Since our incremental cost from scaled up bg dynos has been VERY minimal, I am placing a bet this won’t increase the cost much either, and will cause less problems with those long-running combined derivatives jobs.

But we should keep an eye on it. We should get notifications from hirefire if we have more than 2 dynos for more than 2 hours, so hopefully that will also keep the setting change from resulting in surprise bills.

Screen Shot 2022-01-26 at 2 57 02 PM

jrochkind commented 2 years ago

I think after tweaks #1560 and #1579, this job is performing faster with less RAM, and may no longer be a RAM issue even running on standard-1x dyno with other work.

going to move tis back to backlog and see if problem reoccurs.

eddierubeiz commented 2 years ago

Very few, if any, vastly exceeded errors recently, but we're still getting plain "Memory quota exceeded" R14 errors. This is more of an annoyance than a real problem, but J. suggests tweaking the HireFire settings as a possible fix:

As of early March the process per worker dyno ratio in HireFire is

We could change the ratio to (e.g.)

jrochkind commented 2 years ago

Closing this one, I think we have responded to it to our satisfaction, including in #1744