End of stream situations not handled correctly (last frames are dropped)

microsoft / FFmpegInterop

This is a code sample to make it easier to use FFmpeg in Windows applications.

Apache License 2.0

1.29k stars 310 forks source link

End of stream situations not handled correctly (last frames are dropped) #217

Open lukasf opened 6 years ago

lukasf commented 6 years ago

We currently do not handle end of stream correctly for uncompressed sample providers. The decoders might contain buffered data at end of stream, which gets discarded right now.

Please check the following from ffmpeg docs:

End of stream situations. These require "flushing" (aka draining) the codec, as the codec might buffer multiple frames or packets internally for performance or out of necessity (consider B-frames). This is handled as follows:

Instead of valid input, send NULL to the avcodec_send_packet() (decoding) or avcodec_send_frame() (encoding) functions. This will enter draining mode.

Call avcodec_receive_frame() (decoding) or avcodec_receive_packet() (encoding) in a loop until AVERROR_EOF is returned. The functions will not return AVERROR(EAGAIN), unless you forgot to enter draining mode.

Before decoding can be resumed again, the codec has to be reset with avcodec_flush_buffers().

https://www.ffmpeg.org/doxygen/3.4/group__lavc__encdec.html

Currently, we just stop decoding when we do not get a packet anymore. But instead, we have to send NULL packet to the decoder and then drain the decoder. Only then we are really at end of stream.

brabebhin commented 6 years ago

Hmm. Regarding this, we also need to take into account the encoder trailing padding. Most files end with a few seconds of silence.

lukasf commented 6 years ago

True, but that should be handled separately. It is also only relevant for a few audio codecs, while this problem here is relevant for most codecs and results in loss of output data.

brabebhin commented 6 years ago

Ok. How shall we proceed? Do you want to do the honours in your branch or maybe I should do it? I am unavailable for dev until the weekend.

lukasf commented 6 years ago

After looking into the code, I think that we need to refactor the MediaSampleProvider class quite a bit, more than I expected. Right now, it tries to provide a lot of common infrastructure for both compressed and uncompressed streams. But thing is, compressed and uncompressed streams have very different ways to process data. Adding in the decoder draining could be pretty messy. It would probably be a lot better to remove most of the processing methods there and make GetNextSample abstract. Then add a CompressedSampleProvider base class for all compressed streams, which does a simple processing loop based on packets. And then there is already the UncompressedSampleProvider base class, which needs to get a different processing loop based on frames, not on packets, including proper end of stream handling.

And while doing that, I would also propose to remove DataWriter from all base methods. If a sample provider needs it, it can just create one. But there is no need to force that into every stream. Even for most CompressedSampleProviders, we could use direct buffer approach quite easy: AVPackets are ref counted, we can just hand out it's buffer and defer the unref until our NativeBuffer is destroyed.

But this is a bigger change. I am not sure if we should to it in the already large direct-buffer branch. Also, there are at least two PRs which improve presentation time handling. They should better be merged before we refactor these things.

brabebhin commented 6 years ago

If I am not mistaken, the uncompressed sample providers get their frame from the MediaSampleProvider above them. I do not think we really need to alter that much.

But before that, we could use some files, perferably encoded with different properties, to test this. How did you discover we do not feed the last samples on the stream? simply by reading the docs, or you found out via file playback?

lukasf commented 6 years ago

Exactly that is the problem, that UncompressedSampleProvider gets an AVPacket from MediaSampleProvider, feeds it to the decoder, and then gets a frame and creates the resulting MediaStreamSample. But at end of stream, we'd need to first send a null packet to the decoder. Then we'd need to get multiple frames from the decoder and create multiple samples, without feeding a single packet to the decoder or down the pipeline. MediaSampleProvider does not even do anything when it's packet queue is empty. We'd need to tell it to first pass down a NULL packet (which would probably cause errors in the pipeline), and following that, next GetNextSample calls should not pass any AVPacket to the decoder, but still we'd need to get frames from the decoder and create samples. This all does not fit together at all. It would be a very dirty hack if we somehow trick this into the current pipeline. At least, I wouldn't want to do that. The only real clean solution is to separate compressed and uncompressed processing loops.

I discovered it by reading the docs, but I also saw myself that we need to feed lots of packets into the decoder before the first frame can be read. Then we usually get 1 frame per packet (for video codecs). So it is obvious that the decoder has multiple packets in its buffer. There are even technical reasons why it must buffer frames (B-Frames), but also performance reasons (prevent delay due to hdd reads, but also multi threaded decoding will decode multiple frames ahead of time). If we just stop playback on last AVPacket, all remaining packets in the decoder's buffer are discarded and the partially decoded frames never read out. I am pretty sure that this will happen with more or less every file and every codec. Also the ffmpeg docs are very clear about this. This is how end of stream must be handled. There is no other way to get the remaining frames out of the decoder.

brabebhin commented 6 years ago

Or maybe we can create a separate class for this. Call it a "FrameProvider". This should expose a new method called "GetNextFrame()" which will abstract away the code necessary for this, and we can handle the end of stream separatly. GetNextFrame will return an AVFrame ready to be transofrmed into a MediaStreamSample.

Then, when the existing sample providers call "GetFrameFromFFMPEG" or whatever the name of that method is, we call into this frame provider instead. Each SampleProvider should then know what to do with the frame (if compressed it simply passes its data on, if uncompressed proceeed to decode it).

This way we also separate the FEMPEG reading part from the actual decoding.

GetNextFrame would work the same way the MediaSampleProvider works with getting the next frame, but the only difference is, when it encounters a null from the packet queue, it then goes a different route and continues to provide samples from decoder buffer until it is all over.

I think this is the least intrusive way of changing it. And the next time we find something weird about the ffmpeg readers, we would only need to modify one class instead. We can go as far as easily adding audio or video effects at this stage.

What do you think?

I do not think it is best to separate the uncompressed from compressed sample providers. The current infrastructure allows us to do transcoding of compressed samples, even though it is not yet ready for this. For example, you may want to convert a FLAC compressed sample to an MP3, or the other way around, depending on what kind of hardware acceleration is possible. IIRC, the difference in handling compressed from uncompressed frames is that compressed frames are fed "as is" to the media pipeline, whereas the uncompressed go through the additional step of "DecodeAvFrame", method which is present in MediaSampleProvider but it is skipped in case of compressed samples (and uncompressed overrides it). I think the current approach offers more flexibility, and this is probably part of the reason Microsoft made it this way, aside from recycling code.

lukasf commented 6 years ago

Maybe I was not very clear about how I imagine it. There would still be the base class MediaSampleProvider, it would still have GetNextSample, plus some base methods that will be used by both compressed and uncompressed streams. Just the processing loop will be moved into the more specific base classes CompressedSampleProvider and UncompressedSampleProvider. So I do not think you will lose any functionality you might require for transcoding. You can still get data from every stream throgh MediaSampleProvider::GetNextSample, no matter if it is internally a compressed or uncompressed sample provider.

You also need to differntiate between frames and packets. For compressed sample providers, they do not use frames at all. For them it's all about packets. They read a packet, some will slightly modify it, then they write the packet into the output sample. One packet - one sample. For uncompressed sample providers, it's all about frames. You read a frame from the decoder and create a sample from it. Only if decoder returns EAGAIN, you need to read a packet and push it to decoder, then you try to read the frame again. If you do not have any packets anymore, push NULL to decoder, then get next frame. Only if decoder returns end of stream, you are finished. Sometimes you need a packet to produce a frame, sometimes you need multiple packets, sometimes you need a NULL packet, sometimes you need no packet at all.

The processing loop in MediaSampleProvider is currently packet based. It always starts with GetNextPacket. Only if one is found, it calls DecodeAVPacket, then it calls WriteAVPacketToStream. This works well for compressed streams, but it is very hard to fake the frame-based uncompressed loop into this packet based loop.

There is even one more possible bug in the current UncompressedSampleProvider, I already noticed that one earlier. When pushing a packet into the decoder, the decoder can return EAGAIN if it's buffer is full. This is not handled and would abort the stream. Comment says the decoder should have been drained, but we only ensure that in the audio decoder. For video decoders, it is more or less by chance that this does not lead to errors. This is again caused by forcing a packet based loop. It would be better to always focus on frames for uncompressed and only ever read and push a packet if requested by the decoder.

brabebhin commented 6 years ago

Regarding the EAGAIN, I am pretty sure that was taken into account.

        if (decodeFrame == AVERROR(EAGAIN))
        {
            // The decoder doesn't have enough data to produce a frame,
            // return S_FALSE to indicate a partial frame
            hr = S_FALSE;
            av_frame_unref(pFrame);
            av_frame_free(&pFrame);
        }
        else if (decodeFrame < 0)
        {
            hr = E_FAIL;
            av_frame_unref(pFrame);
            av_frame_free(&pFrame);
            DebugMessage(L"Failed to get a frame from the decoder\n");
}

I do not think we will ever be in the case of filling its buffer completely. We DO consume frames, according to the docs, EAGAIN is risen only if you do not consume frames but keep attempting to send packages into the decoder.

For example, you can call avcodec_send_packet() repeatedly without calling avcodec_receive_frame(). In this case, avcodec_send_packet() will succeed until the codec's internal buffer has been filled up (which is typically of size 1 per output frame, after initial input), and then reject input with AVERROR(EAGAIN). Once it starts rejecting input, you have no choice but to read at least some output.

For the moment, I do now know for sure which is the best approach. At first glance you may be right, but I really need to test some stuff with the debugger attached and see exactly how it behaves. I know the logic is split between several layers, so some cases are not always handled at the same level of abstraction, so it might be a bit confusing. I am still not 100% convinced we really need to do huge changes to how the sample providers work. So far, I never had files finishing in the middle of the stream, nor users complanining they are missing frames. The only issue I did encounter was something with real time playback skipping frames, but I am 99.99999% sure this was simply a bug in mobile MF. And I have been using this framework for 2 years in production now.

Obvious bugs such as #198 , #214 , #195 , #121 were visible quite fast on the other hand...

brabebhin commented 6 years ago

I still think this can be done with just adding to the current structure: When the FFMPEG reader returns a null, which currently signals the EOF and causes the last nullptr sample in the stream, we set the SampleProviders to enter "Flush mode". The AVPacket used to send data to the ffmpeg decoder can be safely unrefed and disposed of at this point, since it will not be used anymore. In the "Flush mode", the GetFrameFromFFmpegDecoder will call avcodec_send_packet with NULL avPacket instead.

I wonder if avcodec_free_context would also unref packages buffered inside it.

brabebhin commented 6 years ago

Something along the lines of this


while (SUCCEEDED(hr) && !frameComplete)
    {
        // Continue reading until there is an appropriate packet in the stream
        while (m_packetQueue.empty() && !m_isInFlushMode)
        {
            if (m_pReader->ReadPacket() < 0)
            {
                DebugMessage(L"GetNextSample reaching EOF\n");
                m_isInFlushMode = true;
                hr = E_FAIL;

                break;
            }
        }

        if (!m_packetQueue.empty()  || m_isInFlushMode)
        {
            // Pick the packets from the queue one at a time
            avPacket = PopPacket();
            framePts = avPacket.pts;
            frameDuration = avPacket.duration;

            if (m_isInFlushMode)
            {
                hr = DecodeAVPacket(writer, NULL, framePts, frameDuration);

            }
            else {
                // Decode the packet if necessary, it will update the presentation time if necessary
                hr = DecodeAVPacket(writer, &avPacket, framePts, frameDuration);
            }
            frameComplete = (hr == S_OK);

            if (!frameComplete)
            {
                m_isDiscontinuous = true;
                if (allowSkip && errorCount++ < 10)
                {
                    // skip a few broken packets (maybe make this configurable later)
                    DebugMessage(L"Skipping broken packet\n");
                    hr = S_OK;
                }
            }
        }
    }

lukasf commented 6 years ago

I was talking about this:

                int sendPacketResult = avcodec_send_packet(m_pAvCodecCtx, avPacket);
        if (sendPacketResult == AVERROR(EAGAIN))
        {
            // The decoder should have been drained and always ready to access input
            _ASSERT(FALSE);
            hr = E_UNEXPECTED;
        }
        else if (sendPacketResult < 0)
        {
            // We failed to send the packet
            hr = E_FAIL;
            DebugMessage(L"Decoder failed on the sample\n");
        }

I did not say that this does cause problems. It probably works and I have not seen it fail. But here we just assume that we can always push a new packet to the decoder, only because we have read one frame before. That is probably true, but it is not guaranteed anywhere and there is no good way to recover from this (because we cannot just discard the packet). Ffmpeg docs say, if we get EGAIN here, we must read frames from the decoder, only then we must push the next packets. We cannot do that here, so we just abort with an error, and hope it will never occur. Having a different decoding loop would solve this in a safe way, instead of relying on implicit assumptions about decoders.

Sure the end of stream handling can be hacked into the current solution somehow. But even the current solution is already pretty messy if you ask me. Too many levels of indirection and too many places to hook into it, to satisfy the very different requirements. And it will get even more messy if we add more special cases and workarounds. The question is why we try to force this into one loop when there are so many differences and things that obviously do not fit together. Reuse is good, but only if there is really common functionality. If the processing loops are very different, we do not gain anything by forcing them into the same base loop and then adding lots of workarounds and complicated corner cases to make it work somehow. The code would be much easier to read and understand if we just have two different implementations for the two very different ways to process data. [Plus, it would allow us to go for direct buffer even for audio streams, which is currently not possible due to the enforced packet based loop.]

brabebhin commented 6 years ago

AH, different code snippets for different things 📦

Regarding the EAGAIN, we can surely get a frame there, we do it just a few lines further down (it is simply the role of that method after all), like so:


int sendPacketResult = avcodec_send_packet(m_pAvCodecCtx, avPacket);
        if (sendPacketResult == AVERROR(EAGAIN))
        {
            // The decoder should have been drained and always ready to access input
            AVFrame *pFrame = av_frame_alloc();
            // Try to get a frame from the decoder.
            decodeFrame = avcodec_receive_frame(m_pAvCodecCtx, pFrame);
            hr = E_UNEXPECTED;
        }
        else if (sendPacketResult < 0)
        {
            // We failed to send the packet
            hr = E_FAIL;
            DebugMessage(L"Decoder failed on the sample\n");
        }

In the end, I guess it is up to @khouzam to decide how this goes on, because this change is a little deeper than any other PR so far. But the way I see it we have already a good infrastructure for what we need.

I agree with your assertion on the levels of indirection,it is a bit on a hard difficulty level, but I grew to like it since it is easy to extend with new features. As I said, @khouzam is probably the best one to decide on this. I had a lot of weird moments the first time I had to debug something on this, but my guess is that also the old style C++/h structure and the lack of "override" keyword makes it hard to understand how this goes.

At most we can create some new methods or classes to handle some stuff separately, and reduce some coupling, but I do not think the class hierarchy needs a change. You will never be able to find a "one size fits all" formula anyway.

lukasf commented 6 years ago

But what do you do with the packet then? Sure, just add some more workarounds...

I am not afraid to refactor things if I see need for it. There is no reason to keep old code, just because it is there. And it is not like this is a lot of code or a very difficult refactoring. In my job we do much bigger refactorings (if we get time to do it - which is usually not the case ;) ). Of course, we'd need to agree if it makes sense and how the new design should look like. Especially @khouzam would need to agree.

If I find some time, I will make a proposal in a branch. Then we can discuss based on actual code, if it is an improvement or not. I just see a lot of reasons to refactor this. There is the broken end of stream handling, there is the EAGAIN issue we discussed, there is the currently forced, inefficient DataWriter, there is the already difficult call hierarchy, there is the impossibility to add direct buffer for audio... Lots of reasons to change this and make a new clean solution, instead of making it even more complicated.

brabebhin commented 6 years ago

The way I see it, there is no way to avoid these "corner cases". It is simply the way FFMPEG works and I don't see how you can avoid it.

End of stream and buffer saturation can treated the same way incomplete frames are treated, using some HRESULT from the methods encountering this problem and solving it further up the hierarchy. The way I see it, the current structure is pretty smart: MediaSampleProvider does the work at package level (and also handles compressed cases which happen to also work at package levels) and the UncompressedMediaSampleProvider works at Frame lavel (where one package can have 0 or more complete frames).

Sure, I will summon @khouzam once again so he can come in and say what he thinks about it :P after all, mail spam isn't all that bad.