imageio / imageio-ffmpeg

FFMPEG wrapper for Python
BSD 2-Clause "Simplified" License
225 stars 50 forks source link

Race condition: We had to kill ffmpeg to stop it. #13

Closed TinyTheBrontosaurus closed 5 years ago

TinyTheBrontosaurus commented 5 years ago

I'm trying to encode several mp4s at a time, and all of the larger ones give me the message We had to kill ffmpeg to stop it. Most of the files for which this message is printed are corrupted. Looking at the code, there appears to be a race condition in there where it only gives ffmpeg 2.5s to clean up. Is this needed? Or is there a way to disable that?

almarklein commented 5 years ago

It is a safety check to prevent runaway ffmpeg processes. The 2.5 seconds is somewhat arbitrary. It was assumed to be enough. Is it likely that in your case ffmpeg would finish by itself, and if so, could we perhaps consider a more reasonable timeout?

TinyTheBrontosaurus commented 5 years ago

in my test case i had about 500 videos i was trying to encode and 400 of them were corrupted with this message. I went in and disabled the timeout and all 500 of the succeeded. I'm running in a scenario where the CPU will frequently become starved due to other things going on and that's why it's hitting the timeout.

If this runaway ffmpeg check is done with a timeout, it will by definition always be a race condition that will frequently occur on resource constrained systems. But I guess I don't fully understand in what cases ffmpeg is out of control to give a better idea of how to remove the race condition.

TinyTheBrontosaurus commented 5 years ago

As a workaround, i'm current adding this to my code:

with mock.patch("imageio_ffmpeg._io.subprocess.Popen.kill"):
TinyTheBrontosaurus commented 5 years ago

more info: my latest run (before i implemented the workaround above) was on my MacBook Pro (2015, High Sierra, py3.7) and it still corrupted the video. Created a 1600x2000 resolution video with 3600 frames. Nothing else significant was running on my computer.

almarklein commented 5 years ago

Ok, so this applies to writing a video, correct? I suppose what's happening is that Python fills up the buffer (stdin for ffmpeg) and ffmpeg is unable to process all frames in time. Perhaps reducing the size of this buffer will help ensure that ffmpeg can (nearly) always process the frames in time.

An additional/alternative option could be to add an option to skip this kill step.

TinyTheBrontosaurus commented 5 years ago

Yes, it is only for writing. i'm having no issues on the read side.

jmetz commented 5 years ago

I've just run into this issue in a pretty lightweight scenario (low frame count - 35, but large input frame size of 2048x2048).

I upped the timeout to 10s, during which ffmpeg closed normally.

This is on the following machine

Linux 4.9.0-8-amd64 #1 SMP Debian 4.9.144-3 (2019-02-02) x86_64 GNU/Linux

It is not a particularly low-spec machine (8 cores, 32GB RAM).

almarklein commented 5 years ago

If someone wants to have a go at this before I find the time to do so: I think we should try to lower the buffer size of stdin when we open the subprocess, and see if this helps.

jmetz commented 5 years ago

If someone wants to have a go at this before I find the time to do so: I think we should try to lower the buffer size of stdin when we open the subprocess, and see if this helps.

Does that make sense given that increasing the timeout allows it to close? Is the buffer being filled with more data in that extra time allowing it to be flushed?

almarklein commented 5 years ago

Python fills up the buffer with frames, and ffmpeg consumes them. Python will slow down if ffmpeg consumes frames slower than Python adds them. When Python is done, it waits for ffmpeg to finish. So if the buffer is considerably smaller (but large enough to fit at least say 5 frames) then the wait for Python should be negligible (well that's my theory).

jmetz commented 5 years ago

Ah I think I see what you mean - should be a fairly easy thing to test anyway. Might have time this weekend if no-one has a stab at it.

TinyTheBrontosaurus commented 5 years ago

I don't have a great understanding of the issue that causes the kill command to be needed at all. can anyone give more info about how to repro that? I've been able to code thousands of videos without issue by removing the kill

On Wed, Apr 3, 2019, 8:22 AM Jeremy Metz notifications@github.com wrote:

Ah I think I see what you mean - should be a fairly easy thing to test anyway. Might have time this weekend if no-one has a stab at it.

— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub https://github.com/imageio/imageio-ffmpeg/issues/13#issuecomment-479465852, or mute the thread https://github.com/notifications/unsubscribe-auth/ARjM9l60XDeOhY4LIWUKVXiqHO_9NJL1ks5vdJzpgaJpZM4cFe_m .

almarklein commented 5 years ago

We've had reports in earlier versions (in imageio) of daemon ffmpeg processes. Although the chance of this happening is way larger with reading from ffmpeg, I still very much like the idea that one can rely on the fact that ffmpeg has stopped when the generator exits. Even if ffmpeg finishes eventually, you may end up with many ffmpeg processes that compete for resources.

TinyTheBrontosaurus commented 5 years ago

are there any repros for the read issue? I'm doing a lot of reading as well and haven't seen the message printed

almarklein commented 5 years ago

I think that this issue applies mostly to writing. When reading, Python first asks ffmpeg to stop and then waits for it do so. There is no buffer to process first. It's good to hear that you've not seen the message: it means that the way that we tell ffmpeg to stop is working well :)

jmetz commented 5 years ago

I can confirm that for me this only happens for very large input frames. I tested with original size frames (read from jpg images) of size 2048x2048, and the issue was present and required a larger timeout.

Then I tested for half size images (1024x1024) and the issue disappears.

Just gonna see if I can narrow in on the size that this first crops up at.

jmetz commented 5 years ago

On my machine, the approximate size at which this occurs (it's not 100% consistent), is between 1864, 1864 and 1884, 1884, both of which also result in swscaler warnings as well as an automatic alignment to the 16bit block size. These warnings also occur at lower sizes which don't result in the

We had to kill ffmpeg to stop it.

issue.

IMO, at least as far as it pertains to my situation, this issue is entirely due to FFMPEG just not being able to finalize within the 2.5s allocated window for very large frames.

Given that the process is polled at a 0.01 second interval anyway, I would recommend that the simplest solution would be to simply increase the timeout to e.g. 5s in order to accommodate situations with very large frame sizes.

There should be absolutely no affect on smaller frame sizes, and at worst if an actual hang occurs, the process is just killed a little later.

jmetz commented 5 years ago

Simple modification to the included imageio-ffmpeg tests to reproduce:

import numpy as np

...

def test_write_big_frames():

    sizes = []
    for pixfmt, bpp in [("gray", 1), ("rgb24", 3), ("rgba", 4)]:
        # Prepare for writing
        gen = imageio_ffmpeg.write_frames(
            test_file2, (2048, 2048), pix_fmt_in=pixfmt)
        gen.send(None)  # seed
        for i in range(9):
            data = bytes((255 * np.random.rand(2048 * 2048 * bpp)).astype(int))
            gen.send(data)
        gen.close()
        with open(test_file2, "rb") as f:
            sizes.append(len(f.read()))
        # Check nframes
        nframes, nsecs = imageio_ffmpeg.count_frames_and_secs(test_file2)
        assert nframes == 9

NB: Only included numpy because generating the required random bytes with random in a list comprehension was much too slow!

This also highlights that the issue depends on having non-uniform data in the frames, which presumably causes ffmpeg to spend longer in the encoding...

djhoese commented 5 years ago

I've been running in to this a lot lately when trying to generate mp4s of satellite instrument data. Is there currently a workaround for this? If not, would a keyword argument be possible?

jmetz commented 5 years ago

I've been running in to this a lot lately when trying to generate mp4s of satellite instrument data. Is there currently a workaround for this? If not, would a keyword argument be possible?

Yes - you can set the timeout to a higher value. I.e. line 424 in _io.py (https://github.com/imageio/imageio-ffmpeg/blob/master/imageio_ffmpeg/_io.py#L424), increase the hard-coded 2.5 up to 10.0 or similar.

As a proposed fix, I would suggest setting this value via a kwarg such that write_frames can be specifically called with a higher timeout to avoid this issue.

djhoese commented 5 years ago

Ok I can add this and make a PR. Would timeout be a little too generic? Maybe ffmpeg_timeout?

jmetz commented 5 years ago

Ok I can add this and make a PR. Would timeout be a little too generic? Maybe ffmpeg_timeout?

Yes - I'd also go with ffmpeg_timeout I think :+1:

djhoese commented 5 years ago

@jmetz I tried adding your test to test_io.py and it is telling me that it's producing 72 frames instead of 9. Any ideas?

Edit: Also on my mid-2015 macbook I had to increase the timeout to 10 seconds. 5 seconds still hit the timeout.

djhoese commented 5 years ago

Nevermind, I got the test to pass (see #14) by forcing the dtype to uint8.

TinyTheBrontosaurus commented 5 years ago

re workaround, what i posted above has worked fine for me (though it does clutter the code) https://github.com/imageio/imageio-ffmpeg/issues/13#issuecomment-476032276

with mock.patch("imageio_ffmpeg._io.subprocess.Popen.kill"):

that completely eliminates the race condition by removing the kill.

djhoese commented 5 years ago

@TinyTheBrontosaurus IMO this is not a sustainable solution. Users should not have to mock a library to use it as advertised. Mocking should be reserved for testing. I think my pull request linked above is the best long term solution.

TinyTheBrontosaurus commented 5 years ago

agreed.

I re-posted because someone requested a workaround, which typically aren't long-term solutions. This can be added without waiting for a PR to get approved, merged, released and posted.

almarklein commented 5 years ago

Fixed by #14 and #15.

almarklein commented 5 years ago

A new release is now available on Pypi.

almarklein commented 5 years ago

(and I just updated the conda-forge feedstock too)

imranux commented 5 years ago

I have also face this problem. After 67 percent-encode in my video then this error message shown. How can fix this? error is : "We had to kill ffmpeg to stop it. Consider increasing ffmpeg_timeout."

TinyTheBrontosaurus commented 5 years ago

did you try increasing the ffmpeg_timeout? it's an argument to imageio_ffmpeg.write_frames. I simply set mine to 0 every time so that the timeout is ignored. I've never hit a situation where that timeout actually helped me

djhoese commented 5 years ago

@mdimranuddin What version of imageio are you using? If you are using anything older than 2.5.0 then this error/warning is much more likely to happen. Otherwise, try ffmpeg_timeout as @TinyTheBrontosaurus has suggested.

imranux commented 5 years ago

@djhoese @TinyTheBrontosaurus Thank you, sir, both of you. I'm used, anaconda navigator environment, Spider IDE. I've also used the latest version of "imageio" 2.5.0. Finally, I'm increasing the "ffmpeg_timeout", then It's working smoothly. But it's slowly working because of my RAM. But finally It's working, Again Thanks to both of you.

OrganicIrradiation commented 4 years ago

I'm receiving this warning when using imageio.get_writer() and writer.append_data() for a large video being saved to a network share. How would I go about passing ffmpeg_timeout to increase (or ignore) the timeout?

djhoese commented 4 years ago

@OrganicIrradiation What version of imageio are you using? Are you writing excessively large frames? I believe ffmpeg_timeout can be passed to append_data iirc.

OrganicIrradiation commented 4 years ago

@djhoese v2.5.0 from PyPi.

I just took a look at https://github.com/imageio/imageio/blob/master/imageio/plugins/ffmpeg.py and _initialize() does not include a method of parsing ffmpeg_timeout nor does _append_data(). I confirmed by adding ffmpeg_timeout=0 to the imageio.get_writer() init, the writer.append_data() call, and to the writer.close(). All gave me a unexpected keyword argument 'ffmpeg_timeout' exception.

djhoese commented 4 years ago

You are right. The new default timeout just happened to be good enough for most people. I'm curious what size data you are writing that is hitting that timeout?

Anyway, if you'd like to make a pull request I think you would have to add the option to _open and then pass it along after that: https://github.com/imageio/imageio/blob/master/imageio/plugins/ffmpeg.py#L247-L258

OrganicIrradiation commented 4 years ago

@djhoese I've temporarily changed my imageio. commands to imageio_ffmpeg. to allow for the timeout. Regarding the data size, I'm not sure where the problem is arising, since the timeout is not raising an exception and the warning is shown in the middle of my script. I'll try to investigate further, but this is a temporary work-around.