CreateCombinedAudioDerivativesJob uses a lot of memory for longer oral histories

eddierubeiz commented 2 years ago

CreateCombinedAudioDerivativesJob ran out of memory this morning during the creation of the Maureen Charron oral history. @sunicolita ran it again an hour later and the job succeeded, so we're not facing an immediate problem. Still these derivatives are straining the memory resources of our current Heroku dynos, and this ticket is about whether we can make these jobs more reliable.

About the problem:

For testing purposes, the code to run is: CombinedAudioDerivativeCreator.new(Work.find_by_friendlier_id('c0lar63')).generate.
The problem occurs in the underlying call to ffmpeg_transformer.rb, and takes the form of a series of Error R14 (Memory quota exceeded) errors, followed in some cases by a TTY::Command::ExitError as the derivative creator job fails.

jrochkind commented 2 years ago

Hypothetically the Error R14 (Memory quota exceeded) alone won't always result in a problem -- heroku does NOT shut down your process for an R14. It will be running on swap so running very slowly, which may have somehow resulted in ffmpeg deciding it was an error. It's kind of a mystery why it actually failed this time.

So if you only see R14 without the actual error, it probably didn't actually fail. The TTY::Command::ExitError is what meant a failure happened, the R14 alone doesn't mean that.

On the other hand, if you see R15 - Memory quota vastly exceeded (which we haven't knowingly for this problem yet) -- that actually means heroku killed your processes, and something probably failed -- and it likely won't even be logged as a further exception, cause heroku killed it hard before it could even log or register an error with honeybadger etc.

jrochkind commented 2 years ago

One idea... what if the out of memory happens only/mostly when we have multiple busy workers on a worker dyno?

We could totally make a new queue just for combined audio derivative, set to only have one worker at a time, and have hirefire scale up from 0 for it. At cost of waiting longer (possibly up to 2-3 minutes) for your combined audio derivatives to get created.

To ensure there is only ONE worker running at at ime on the dyno, not competing for RAM with others. (We probably still wouldn't want to use a bigger than standard-2x dyno though for cost.... although could compute the cost for larger dynos that are only going to be scaled up occasionally for such jobs, maybe it would still be ok?)

Nic confirms it's no big deal to them if they have to wait ~4 minutes longer for the audio deriv to be created. (Waiting for hirefire to notice a scale up is needed, and then the time it takes to start up a heroku dyno)

jrochkind commented 2 years ago

We also still could do some investigations into if there's a way to get ffmpeg to do this task with less RAM. How do we find an ffmpeg expert to ask? Stackoverflow?

Or if there's some software other than ffmpeg that could do this task with less RAM. But i know ffmpeg is very popular/industry standard for A/V manipulations.

jrochkind commented 2 years ago

These Heroku R14 out of memory error messages from a worker dyno are happening a lot today. (confirmed combined audio derivs were being created today, so feel safe in my belief that's what caused the R14's)

It hasn't interfered with any actual production of audio derivatives, but it's on the edge. @apinkney0696 while not urgent priority, we should probably consider this somewhat high priority, to take a look at in the next couple months.

I'm going to just put it at the top of "backlog" for now.

jrochkind commented 2 years ago

OK, we had some more of these today which allowed me to investigate more.

The "problem" is that the process to create a combined derivative can:

Take a very long time (40 minutes or more)
Use a lot of RAM

[I think the ones that are having problems are very-long (6-12 file and 6-12 hour) interviews, which are also FLACs. (The mp3 sources seem to compute into combined derivatives fairly fast without RAM problems). ]

These facts can result in two classes of "failure" or "problem"

If something uses REALLY too much RAM (double our quota), Heroku will shut it down with a hard process kill, and a R15 - Memory quota vastly exceeded error in logs. These are killed so harshly they do not report to our HoneyBadger monitoring as an error, and also don't register themselves in the DB as an error, just a completely silent error. This condition only happens very occasionally though.
If something is using over our heroku RAM quota but not double, it will generate a series of R14 - memory quota exceeded in the logs. This also means it is using disk 'swap' for RAM, which will make an already slow process even slower.
If something tries to restart a dyno when it's in the middle of doing a combined derivative jobs, and the job takes more than 5 minutes to finish, it can be interupted. A very slow job has more of a chance of this happening.
- It could be heroku doing a regular restart, or a restart on code deploy -- heroku does restarts routinely!
- Or it could be hirefire scaling down our dynos -- once the queue is small, hire fire under current settings will be like "ok, you don't need that many dynos", and it isn't able to just shut down "idle" ones, it tries to shut down arbitrary ones, and they'll wait at most 5 minutes for a job to finish.
- When a job finishes like this, it will generate a Resque::PruneDeadWorkerDirtyExit in HoneyBadger logs. It should register "failed" in the DB, displayed on staff admin screen -- but it does not look to me like this is happening, not sure why. It should also be re-enqueued to be retried once, as our our failed jobs should be retried, but I'm not sure this is happening.

Some evidence

One work that had combined oral history fail was https://digital.sciencehistory.org/works/uzh0kyb.

It has 14 FLAC files, a total of around 12G.

Creating a combined derivative on my macbook took around 45 minutes -- not sure how much of this time was just pulling down the FLACs from S3, I think that alone could be taking like 10 minutes or more!

If I tried to create the combined derivative in a one-off standard-2x dyno, it generated lots of R14 errors, even though it was the only thing running in that dyno. It didn't ever get to heroku-kill R15 in the 40 minutes I waited for it -- but n our actual worker dynos with 3 workers, you could easily see this happenign if all three are working on an excessive RAM task. I killed it before it completed.

I tried again on a performance-m dyno, and no heroku R14 or R15 errors were generated. It still took 40 minutes to complete. This is a low process.

Ameliorations

Short/Medium term

If any of these fail, we should create them manually in a one-off heroku run --size=performance-m dyno.
We should possibly change hirefire to be more conservative with shutting down dynos, to reduce the chance it will interupt a long-running job in progress [done, see comment below]

Medium term

We are currently creating both an mp3 and a webm derivative. We can make the job slightly faster (maybe 20% or so I think, not earth-shaking) by reducing the number of derivatives we create, I don't think we need both these formats.
- My research says the mp3 alone would be playable in virtually all browsers
- But also we could use a newer MP4 Audio AAC file (suffix .m4a or .mp4; AAC coded in an MP4 audio-only container), which is a newer format that would give us smaller file sizes, and is also playable in virtually all browsers right now.
We could separate out "combined audio deriv" jobs into their own queue/pool of workers
- have them be in their own ActiveJob queue
- Define a new worker type in heroku Procfile that will be workers only working on this queue -- and only one worrker per dyno
- Make it a performance-m dyno (big enough to do these jobs without RAM problems), but set to 0 scale.
- Have hirefire scale up that dyno type (maybe only to 1 or 2, although it shouldn't matter as long as it scales them down when work is over) when there's work in the queue -- might be a bit of work to figure out how to set up, hirefire metric just on that queue, etc.
- This could allow the combined audio derivs to complete succesfully, while hopefully keeping costs reasonable because the performance-m would be scaled past 0 only on the rare occasions there's work to do. The other trade-off is you might have to wait and extra 2-4 minutes for a combined audio deriv to be created (lag time waiting for hirefire to notice and scale up), which staff says is tolerable.

Longer term

We could stop ingesting FLAC into the repo. Now that we have some more formal digital preservation plans, if we have some other preservation platform for audio, maybe we don't really need the "lossless" copy, and a "lossy" mp3 and/or mp4 copy is enough. Giving us smaller files, which are cheaper to store, quicker to download, and apparently give our processes less time.
We could look into using AWS MediaConvert (or other cloud A/V processors? Not sure any exist) to do the combination, instead of doing them in a heroku worker dyno. (Or we could set up our own EC2, but we're trying to get out of that business. In any event, these processes take too long to be really suitable for a worker process).

jrochkind commented 2 years ago

@CHF-IT-Staff I have unchecked “scale-down non-empty queues” in hirefire.

This will make hirefire be a bit slower in reducing our dynos when the bg worklaod goes down — but should significantly reduce the number of times it interupts a long-running bg job (like our combined derivatives). Since our incremental cost from scaled up bg dynos has been VERY minimal, I am placing a bet this won’t increase the cost much either, and will cause less problems with those long-running combined derivatives jobs.

But we should keep an eye on it. We should get notifications from hirefire if we have more than 2 dynos for more than 2 hours, so hopefully that will also keep the setting change from resulting in surprise bills.

Screen Shot 2022-01-26 at 2 57 02 PM

jrochkind commented 2 years ago

I think after tweaks #1560 and #1579, this job is performing faster with less RAM, and may no longer be a RAM issue even running on standard-1x dyno with other work.

going to move tis back to backlog and see if problem reoccurs.

eddierubeiz commented 2 years ago

Very few, if any, vastly exceeded errors recently, but we're still getting plain "Memory quota exceeded" R14 errors. This is more of an annoyance than a real problem, but J. suggests tweaking the HireFire settings as a possible fix:

As of early March the process per worker dyno ratio in HireFire is

1 dyno for every 3 jobs
scaling up to a max of 8 dynos (i.e. 24 jobs).

We could change the ratio to (e.g.)

1 dyno for every 2 jobs
scaling up to max of 12 dynos (still a max of 24).

jrochkind commented 2 years ago

Closing this one, I think we have responded to it to our satisfaction, including in #1744

sciencehistory / scihist_digicoll