Open yondonfu opened 1 year ago
Addressed by: https://github.com/livepeer/FFmpeg/commit/92c358e21b1616b07a61c634365ed441ee097e8c
AFAIK no changes to lpms
are required.
Commited directly to the branch by accident, hope it's trivial enough to not mandate a PR in a secondary repo.
@cyberj0g Nice!
I see that https://github.com/livepeer/FFmpeg/commit/92c358e21b1616b07a61c634365ed441ee097e8c updates ff_cuda_check()
to return AVERROR(ENOMEM)
if a CUDA_ERROR_OUT_OF_MEMORY
is detected. Since LPMS only marks AVERROR_UNKNOWN as an unrecoverable error, with this change, LPMS will no longer mark CUDA OOM errors as unrecoverable.
In the future, it would be nice if there was a way to specifically signal the CUDA_ERROR_ILLEGAL_ADDRESS error from within ffmpeg since even with this change any other CUDA error besides CUDA_ERROR_OUT_OF_MEMORY
would still get marked as unrecoverable since they all reach LPMS as AVERROR_UNKNOWN
. The change in your commit as-is is still a useful improvement though!
Diederick working on this - dec' 22
An internal CUDA function can return CUDA_ERROR_ILLEGAL_ADDRESS during Nvidia transcoding which means that the process is in an inconsistent state s.t. it needs to be restarted. The original context in which we encountered this issue is documented in https://github.com/livepeer/go-livepeer/issues/1921. In https://github.com/livepeer/lpms/pull/267 we implemented a panic whenever an unrecoverable error is encountered, in https://github.com/livepeer/go-livepeer/pull/2057 we bumped the LPMS version to include this update, and then in https://github.com/livepeer/go-livepeer/pull/2094 and https://github.com/livepeer/go-livepeer/pull/2352 we moved the unrecoverable error check into go-livepeer.
The problem is that LPMS will mark any unknown error (indicated by AVERROR_UNKNOWN) as unrecoverable. As a result, some CUDA errors that do not warrant a process restart would be marked as unrecoverable and go-livepeer would panic for those errors.
For example, a CUDA OOM error is also treated as an unknown error by the libav code:
We should only mark CUDA_ERROR_ILLEGAL_ADDRESS errors as unrecoverable so that go-livepeer only panics for those errors.