Closed eddierubeiz closed 2 years ago
Hypothetically the Error R14 (Memory quota exceeded)
alone won't always result in a problem -- heroku does NOT shut down your process for an R14. It will be running on swap so running very slowly, which may have somehow resulted in ffmpeg deciding it was an error. It's kind of a mystery why it actually failed this time.
So if you only see R14 without the actual error, it probably didn't actually fail. The TTY::Command::ExitError
is what meant a failure happened, the R14
alone doesn't mean that.
On the other hand, if you see R15 - Memory quota vastly exceeded
(which we haven't knowingly for this problem yet) -- that actually means heroku killed your processes, and something probably failed -- and it likely won't even be logged as a further exception, cause heroku killed it hard before it could even log or register an error with honeybadger etc.
One idea... what if the out of memory happens only/mostly when we have multiple busy workers on a worker dyno?
We could totally make a new queue just for combined audio derivative, set to only have one worker at a time, and have hirefire scale up from 0 for it. At cost of waiting longer (possibly up to 2-3 minutes) for your combined audio derivatives to get created.
To ensure there is only ONE worker running at at ime on the dyno, not competing for RAM with others. (We probably still wouldn't want to use a bigger than standard-2x dyno though for cost.... although could compute the cost for larger dynos that are only going to be scaled up occasionally for such jobs, maybe it would still be ok?)
Nic confirms it's no big deal to them if they have to wait ~4 minutes longer for the audio deriv to be created. (Waiting for hirefire to notice a scale up is needed, and then the time it takes to start up a heroku dyno)
We also still could do some investigations into if there's a way to get ffmpeg to do this task with less RAM. How do we find an ffmpeg expert to ask? Stackoverflow?
Or if there's some software other than ffmpeg that could do this task with less RAM. But i know ffmpeg is very popular/industry standard for A/V manipulations.
These Heroku R14 out of memory error messages from a worker dyno are happening a lot today. (confirmed combined audio derivs were being created today, so feel safe in my belief that's what caused the R14's)
It hasn't interfered with any actual production of audio derivatives, but it's on the edge. @apinkney0696 while not urgent priority, we should probably consider this somewhat high priority, to take a look at in the next couple months.
I'm going to just put it at the top of "backlog" for now.
OK, we had some more of these today which allowed me to investigate more.
The "problem" is that the process to create a combined derivative can:
[I think the ones that are having problems are very-long (6-12 file and 6-12 hour) interviews, which are also FLACs. (The mp3 sources seem to compute into combined derivatives fairly fast without RAM problems). ]
These facts can result in two classes of "failure" or "problem"
R15 - Memory quota vastly exceeded
error in logs. These are killed so harshly they do not report to our HoneyBadger monitoring as an error, and also don't register themselves in the DB as an error, just a completely silent error. This condition only happens very occasionally though. R14 - memory quota exceeded
in the logs. This also means it is using disk 'swap' for RAM, which will make an already slow process even slower. Resque::PruneDeadWorkerDirtyExit
in HoneyBadger logs. It should register "failed" in the DB, displayed on staff admin screen -- but it does not look to me like this is happening, not sure why. It should also be re-enqueued to be retried once, as our our failed jobs should be retried, but I'm not sure this is happening. One work that had combined oral history fail was https://digital.sciencehistory.org/works/uzh0kyb.
It has 14 FLAC files, a total of around 12G.
Creating a combined derivative on my macbook took around 45 minutes -- not sure how much of this time was just pulling down the FLACs from S3, I think that alone could be taking like 10 minutes or more!
If I tried to create the combined derivative in a one-off standard-2x dyno, it generated lots of R14 errors, even though it was the only thing running in that dyno. It didn't ever get to heroku-kill R15 in the 40 minutes I waited for it -- but n our actual worker dynos with 3 workers, you could easily see this happenign if all three are working on an excessive RAM task. I killed it before it completed.
I tried again on a performance-m dyno, and no heroku R14 or R15 errors were generated. It still took 40 minutes to complete. This is a low process.
heroku run --size=performance-m
dyno. We are currently creating both an mp3
and a webm
derivative. We can make the job slightly faster (maybe 20% or so I think, not earth-shaking) by reducing the number of derivatives we create, I don't think we need both these formats.
.m4a
or .mp4
; AAC coded in an MP4 audio-only container), which is a newer format that would give us smaller file sizes, and is also playable in virtually all browsers right now. We could separate out "combined audio deriv" jobs into their own queue/pool of workers
We could stop ingesting FLAC into the repo. Now that we have some more formal digital preservation plans, if we have some other preservation platform for audio, maybe we don't really need the "lossless" copy, and a "lossy" mp3 and/or mp4 copy is enough. Giving us smaller files, which are cheaper to store, quicker to download, and apparently give our processes less time.
We could look into using AWS MediaConvert (or other cloud A/V processors? Not sure any exist) to do the combination, instead of doing them in a heroku worker dyno. (Or we could set up our own EC2, but we're trying to get out of that business. In any event, these processes take too long to be really suitable for a worker process).
@CHF-IT-Staff I have unchecked “scale-down non-empty queues” in hirefire.
This will make hirefire be a bit slower in reducing our dynos when the bg worklaod goes down — but should significantly reduce the number of times it interupts a long-running bg job (like our combined derivatives). Since our incremental cost from scaled up bg dynos has been VERY minimal, I am placing a bet this won’t increase the cost much either, and will cause less problems with those long-running combined derivatives jobs.
But we should keep an eye on it. We should get notifications from hirefire if we have more than 2 dynos for more than 2 hours, so hopefully that will also keep the setting change from resulting in surprise bills.
I think after tweaks #1560 and #1579, this job is performing faster with less RAM, and may no longer be a RAM issue even running on standard-1x dyno with other work.
going to move tis back to backlog
and see if problem reoccurs.
Very few, if any, vastly exceeded
errors recently, but we're still getting plain "Memory quota exceeded" R14 errors. This is more of an annoyance than a real problem, but J. suggests tweaking the HireFire settings as a possible fix:
As of early March the process per worker dyno ratio in HireFire is
We could change the ratio to (e.g.)
Closing this one, I think we have responded to it to our satisfaction, including in #1744
CreateCombinedAudioDerivativesJob
ran out of memory this morning during the creation of the Maureen Charron oral history. @sunicolita ran it again an hour later and the job succeeded, so we're not facing an immediate problem. Still these derivatives are straining the memory resources of our current Heroku dynos, and this ticket is about whether we can make these jobs more reliable.About the problem:
For testing purposes, the code to run is:
CombinedAudioDerivativeCreator.new(Work.find_by_friendlier_id('c0lar63')).generate
.The problem occurs in the underlying call to ffmpeg_transformer.rb, and takes the form of a series of
Error R14 (Memory quota exceeded)
errors, followed in some cases by aTTY::Command::ExitError
as the derivative creator job fails.