radek-k / FFMediaToolkit

FFMediaToolkit is a cross-platform video decoder/encoder library for .NET that uses FFmpeg native libraries. It supports video frames extraction, reading stream metadata and creating videos from bitmaps in any format supported by FFmpeg.
MIT License
357 stars 56 forks source link

Audio #48

Closed kskalski closed 3 years ago

kskalski commented 3 years ago

Implements reading audio streams for https://github.com/radek-k/FFMediaToolkit/issues/33

radek-k commented 3 years ago

Thank you very much for the audio implementation. I'm going to release it next week (as v4.0).

kskalski commented 3 years ago

Note that this implementation is quite limited - handling codecs, reading video and audio at the same time, writing is missing.

IsaMorphic commented 3 years ago

@kskalski Hello! Would you be able to give more details about what remains to be implemented? Above you mention the following:

I am unsure about what you mean by the first two. If you would be able to clarify what exactly these mean, I may be able to implement the rest of these, as I need this feature for a project I want to start working on.
Thanks in advance.

kskalski commented 3 years ago

About codecs, right now the implementation assumes that input (or precisely the frames read from ffmpeg) are using AV_SAMPLE_FMT_FLT format (see https://github.com/radek-k/FFMediaToolkit/blob/b097b974a9193e48591165f155cfd7e1597c9036/FFMediaToolkit/Common/Internal/AudioFrame.cs#L77 it just assumes we can get series of floats), I'm not really sure if it is always the case and if it have to be (e.g. if user wants to get data in other formats, can they specify so) - maybe we need to support other formats here. Also I'm not sure if it's possible to support scenarios where we can read encoded data, say I would like to get data in FLAC encoding instead of raw floats, possibly even when input file/stream is not using that encoding. That seems like a separate / extra feature though.

About reading video and audio, it's a problem with current implementation in https://github.com/radek-k/FFMediaToolkit/blob/develop/FFMediaToolkit/Decoding/Internal/InputContainer.cs which has a kind of optimization (?) using private readonly MediaPacket packet; - right now this field is "shared" between Audio and Video decoders. ;-) I thought about changing it or abstracting out the "reused packet" feature, but decided it was not necessary for my use-case and would probably require larger refactoring. You would also need to take a look how exactly reading both video and audio should be exposed to the user, since I suppose for best performance the readers need to follow the interleaved way video and audio are present in input stream.

So in general there are a couple of things to investigate:

IsaMorphic commented 3 years ago

Thank you for the swift and detailed response.
I did some looking into FFMPEG's api and from what I can gather it appears that the data FFMPEG gives back is always in the form of raw 32bit floating point PCM (much like how all video codecs give you back an image in some form or another). So it seems like this would mostly be a non-issue. However, I do think you have a valid point in that users should be able to accept both raw input (maybe as a typical byte stream) and decoded input.

To handle interleaving, I think it would make the most sense to create an AVStream class that would essentially allow the user to get the "next" audio and video frames wrapped together in an frame object. It doesn't appear that FFMPEG has a specific api surface for reading interleaved data, essentially the next frame you get from the demuxer is either a video OR an audio frame, so some type of buffering would have to be done to keep stuff in memory until you come across a frame of the right "type".

I probably have some of the specifics of this incorrect, but I'm sure misunderstandings will reveal themselves as I encounter issues.

I'll start a pull request as soon as possible and reference this PR and the associated issue in it.

kskalski commented 3 years ago

Sounds right, for the interleaved reading I suppose we could similarly like now expose the two streams, but internally keep track of position in each stream and provide some reasonable buffering, so that when user asks for several new video frames, but we encounter audio frames (and audio was ever requested), then we keep the audio frames on the side (up to some limit).