mbebenita / Broadway

A JavaScript H.264 decoder.
Other
2.73k stars 427 forks source link

How do I extract frames from a video #241

Closed AnilSonix closed 1 year ago

AnilSonix commented 2 years ago

Can anyone provide a code sample to extract all the frames from a video? I'm not able to get it done via Decoder.js (Could find it in the docs)

J7a4s0m5ine commented 1 year ago

This would be fairly easy if you understand what a h264 stream "looks like," and it's format/structure.

See this SO question.

0x000001 or 0x00000001, is placed at the beginning of each NAL unit.

To extract the frames you would read the stream until you find the beginning of the next NAL unit. So you have a start byte where you identified the end or start of a frame, lets say it's byte 256, you then continue reading the stream until you find the next 0x000001 or 0x00000001, which signifies the beginning of the next frame. Let's say this header is found in byte 512. You now know there is a fully encapsulated frame between bytes 256 and 512 in the stream, and the next frame starts at bye 512.

From this point it's all data and memory management on where and how you want to save the extracted frames.

AnilSonix commented 1 year ago

This would be fairly easy if you understand what a h264 stream "looks like," and it's format/structure.

See this SO question.

0x000001 or 0x00000001, is placed at the beginning of each NAL unit.

To extract the frames you would read the stream until you find the beginning of the next NAL unit. So you have a start byte where you identified the end or start of a frame, lets say it's byte 256, you then continue reading the stream until you find the next 0x000001 or 0x00000001, which signifies the beginning of the next frame. Let's say this header is found in byte 512. You now know there is a fully encapsulated frame between bytes 256 and 512 in the stream, and the next frame starts at bye 512.

From this point it's all data and memory management on where and how you want to save the extracted frames.

Thanks for replying. Looks like I'm not smart enough ☺️ to understand this fully. Could you point me where to get started video processing and codecs etc in general. This is very new to me.

soliton4 commented 1 year ago

https://github.com/soliton4/nodeMirror/blob/cf7db884e61919f1efec92fb3601585c8e3c8f12/src/avc/Wgt.js#L291

this is an example where i decode a raw h264 stream and split it on the 0x00000001 markers it is best to feed complete nals to the decoder, however i believe the latest version is doing nal splitting internaly

J7a4s0m5ine commented 1 year ago

@AnilSonix

Thanks for replying. Looks like I'm not smart enough ☺️ to understand this fully

Sorry, I didn't mean it in that manner. When I first approached video streams, and encoding/decoding, it all looked alien. My point was more that if you understand the underlying streams you'll be able to understand what's going on with this mess of media coding and containerization. It's a fairly complex topic, and as our needs of bandwidth savings increase, it will continue to become more complex. But let's forget about that for now and I'll get back on topic:

Finding frames - your original question

Finding the h264 NALUs (frames) is essentially searching through an array buffer for a pattern, in this case the pattern is the three-byte or four-byte start code. If you don't find the pattern among the current stream data, you append it to an intermediary buffer and continue to read the source stream. This is one pattern to achieve frame parsing. There are times when the start code is different depending upon the h264 format/profile aswell....see the links and information below.

https://github.com/soliton4/nodeMirror/blob/cf7db884e61919f1efec92fb3601585c8e3c8f12/src/avc/Wgt.js#L291 this is an example where i decode a raw h264 stream and split it on the 0x00000001 markers

@soliton4 's code linked is a great example parsing a stream for NALus

1. Loop the incoming data stream to find a start sequence (00x0,00x0,00x0,00x1)

To extrapolate their code against what I said above, they are getting the data from the media source and looping through that array to find the start sequence. When no start code is found, they append the data to the temp buffer and continue to read from stream.

I'm over simplifying here a bit, their code is multithreaded possibly using webworkers and there's a little more going on than what I alluded to. There's also some logic to determine the case where nothing currently exists in the temp buffer and we find a NALu, which would mean it's either the first frame, or we already sent the previous frame to the decoder by the time the start sequence was found. But for all intents and purposes I can simplify the explanation a bit.

I've added some comments for clarity and explanation

  var b = 0;
  var l = data.length;     // get length of the incoming data
  var zeroCnt = 0;
  for (b; b < l; ++b){    // for-loop that uses a zeroCnt variable to keep track of contiguous zeros
    if (data[b] === 0){
      zeroCnt++;
    }else{
      if (data[b] == 1){
        if (zeroCnt >= 3){   // at least 3 contiguous zeros were found!
          hit(b - 3);     // we send the offset location to the "hit" function so it can process the current temp buffer and combine the frame data
          break;
        };
      };
      zeroCnt = 0;
    };
  };
  if (!foundHit){
    this.bufferAr.push(data);    // No start code was found, continue pushing data to temp buffer
  };

}

1. So a start code was found while we were looping

In the case a start code is found we note the exact position it occurs (the offset position in the data buffer) and create a subarray with everything leading up to the offset; everything before the offset position is apart of the previous frame and is concatenated together with the existing temp buffer (bufferAr) and sent to the decoder as a whole frame. The temp buffer is then cleared and everything that was following the offset in the original stream buffer is pushed to the temp buffer to start the loop process over again.

var hit = function(offset){
  foundHit = true;

  // pass subarray at the offset where the start code was found
  self.bufferAr.push(data.subarray(0, offset));
  // concat the two arrays  and push to the decoder
  self.decode( concatUint8(self.bufferAr) );
  // clear the temp buffer
  self.bufferAr = [];                                            
  // Push the second portion of the sliced array to the temp buffer
  self.bufferAr.push(data.subarray(offset));          
};

Other implementations that might be helpful to see

@OllieJones has a library that has a bunch of H264 functionality including searching arrays/streams for frames. Take a look at that repo as a whole, definitely read the README. I linked to a specific portion of their README because it explains the nuances with H264 streams and how they sometimes have different formats for frame separators.

In Ollie's repo they are converting from one media format (webm) to another media container format (mp4) by extracting raw NALus (videoframes + extra data + stream info) from webm and "boxing" those NALUs in mp4's container format.

Another implementation in Java

This is a port of the original FFMPEG code back in 2012

I'm using this example because it's a completely different thought pattern on how a frame parser could be architected. They're heavily using bit shifting while looking for the start sequence.

The frame parsing in this example starts at this try block. You can see they start reading in the file on L139 and walk back and forth through several while loops to do the decoding while using isEndOfFrame as a decision making point and bit shifting to find the sequence.

private boolean isEndOfFrame(int code) {
int nal = code & 0x1F;

if (nal == NAL_AUD) {
    foundFrameStart = false;
    return true;
}

boolean foundFrame = foundFrameStart;
if (nal == NAL_SLICE || nal == NAL_IDR_SLICE) {
    if (foundFrameStart) {
        return true;
    }
    foundFrameStart = true;
} else {
    foundFrameStart = false;
}

return foundFrame;
}

Lastly a project that was inspired by Broadway

It has wasm, ios, c++ and java h264 decoder variants plus some extra goodies

Decent Wikipedia/articles/documentation

I have to cut this short for now and step away from the computer for a bit, if you have more questions or anything feel free to ask. Here's some reading to catch you up on H264 formats, and the like. There's more to decoding than identifying the frames, for example depending upon the decoder you need to "prime" the input buffer with a sequence of SPS+PPS+IFrame in order to initialize it so it can determine the video size.

@soliton4 I'm not sure if this is true for this decoder. I used it a long time ago and heavily modified it for a specific purpose. I don't even have that code anymore to reference.


Anyways here are some resources on H264 video codec and MP4 containers:

https://stackoverflow.com/a/24890903 - This is an amazing write up on the H264 formats (Annex B vs AVCC), how they store information, and how they differ.

Very very simple frame parser implementation

const soi = Buffer.from([0x00, 0x00, 0x00, 0x01]);
function findStartFrame(buffer, i = -1) {
    while ((i = buffer.indexOf(soi, i + 1)) !== -1) {
        if ((buffer[i + 4] & 0x1F) === 7) return i
    }
    return -1
}
soliton4 commented 1 year ago

thats the most detailed answer ever. is there an oscaars of the thread replies? cause u r nominated

On Sat, 18 Feb 2023, 20:35 C9, @.***> wrote:

@AnilSonix https://github.com/AnilSonix

Thanks for replying. Looks like I'm not smart enough ☺️ to understand this fully

Sorry, I didn't mean it in that manner. When I first approached video streams, and encoding/decoding, it all looked alien. My point was more that if you understand the underlying streams you'll be able to understand what's going on with this mess of media coding and containerization. It's a fairly complex topic, and as our needs of bandwidth savings increase, it will continue to become more complex. But let's forget about that for now and I'll get back on topic: Finding frames - your original question

Finding the h264 NALUs (frames) is essentially searching through an array buffer for a pattern, in this case the pattern is the three-byte or four-byte start code. If you don't find the pattern among the current stream data, you append it to an intermediary buffer and continue to read the source stream. This is one pattern to achieve frame parsing. There are times when the start code is different depending upon the h264 format/profile aswell....see the links and information below.

https://github.com/soliton4/nodeMirror/blob/cf7db884e61919f1efec92fb3601585c8e3c8f12/src/avc/Wgt.js#L291 this is an example where i decode a raw h264 stream and split it on the 0x00000001 markers

@soliton4 https://github.com/soliton4 's code linked is a great example parsing a stream for NALus

  1. Loop the incoming data stream to find a start sequence (00x0,00x0,00x0,00x1)

To extrapolate their code against what I said above, they are getting the data from the media source and looping through that array https://github.com/soliton4/nodeMirror/blob/cf7db884e61919f1efec92fb3601585c8e3c8f12/src/avc/Wgt.js#L305 to find the start sequence. When no start code is found, they append the data to the temp buffer and continue to read from stream.

I'm over simplifying here a bit, their code is multithreaded possibly using webworkers and there's a little more going on than what I alluded to. There's also some logic to determine the case where nothing currently exists in the temp buffer and we find a NALu, which would mean it's either the first frame, or we already sent the previous frame to the decoder by the time the start sequence was found. But for all intents and purposes I can simplify the explanation a bit.

I've added some comments for clarity and explanation

var b = 0;

var l = data.length; // get length of the incoming data

var zeroCnt = 0;

for (b; b < l; ++b){ // for-loop that uses a zeroCnt variable to keep track of contiguous zeros

if (data[b] === 0){

  zeroCnt++;

}else{

  if (data[b] == 1){

    if (zeroCnt >= 3){   // at least 3 contiguous zeros were found!

      hit(b - 3);     // we send the offset location to the "hit" function so it can process the current temp buffer and combine the frame data

      break;

    };

  };

  zeroCnt = 0;

};

};

if (!foundHit){

this.bufferAr.push(data);    // No start code was found, continue pushing data to temp buffer

};

}

  1. So a start code was found while we were looping

In the case a start code is found we note the exact position it occurs (the offset position in the data buffer) and create a subarray with everything leading up to the offset; everything before the offset position is apart of the previous frame and is concatenated together with the existing temp buffer (bufferAr) and sent to the decoder as a whole frame. The temp buffer is then cleared and everything that was following the offset in the original stream buffer is pushed to the temp buffer to start the loop process over again.

var hit = function(offset){

foundHit = true;

// pass subarray at the offset where the start code was found

self.bufferAr.push(data.subarray(0, offset));

// concat the two arrays and push to the decoder

self.decode( concatUint8(self.bufferAr) );

// clear the temp buffer

self.bufferAr = [];

// Push the second portion of the sliced array to the temp buffer

self.bufferAr.push(data.subarray(offset)); };


Other implementations that might be helpful to see

@OllieJones https://github.com/OllieJones has a library https://github.com/OllieJones/h264-interp-utils#nalustream that has a bunch of H264 functionality including searching arrays/streams for frames. Take a look at that repo as a whole, definitely read the README. I linked to a specific portion of their README because it explains the nuances with H264 streams and how they sometimes have different formats for frame separators.

In Ollie's repo they are converting from one media format (webm) to another media container format (mp4) by extracting raw NALus (videoframes + extra data + stream info) from webm and "boxing" those NALUs in mp4's container format. Another implementation in Java https://github.com/twilightdema/h264j/blob/3dd2cc2e65e653ecbba247ed95a0bff901c98007/h264j/src/main/java/com/twilight/h264/player/H264Player.java

This is a port of the original FFMPEG code back in 2012

I'm using this example because it's a completely different thought pattern on how a frame parser could be architected. They're heavily using bit shifting while looking for the start sequence.

The frame parsing in this example starts at this try block https://github.com/twilightdema/h264j/blob/3dd2cc2e65e653ecbba247ed95a0bff901c98007/h264j/src/main/java/com/twilight/h264/player/H264Player.java#L137-244. You can see they start reading in the file on L139 and walk back and forth through several while loops to do the decoding while using isEndOfFrame as a decision making point and bit shifting to find the sequence.

private boolean isEndOfFrame(int code) { int nal = code & 0x1F;

if (nal == NAL_AUD) {

foundFrameStart = false;

return true;

}

boolean foundFrame = foundFrameStart; if (nal == NAL_SLICE || nal == NAL_IDR_SLICE) {

if (foundFrameStart) {

  return true;

}

foundFrameStart = true;

} else {

foundFrameStart = false;

}

return foundFrame;

}

Lastly a project that was inspired by Broadway https://github.com/oneam/h264bsd

It has wasm, ios, c++ and java h264 decoder variants plus some extra goodies Decent Wikipedia/articles/documentation

I have to cut this short for now and step away from the computer for a bit, if you have more questions or anything feel free to ask. Here's some reading to catch you up on H264 formats, and the like. There's more to decoding than identifying the frames, for example depending upon the decoder you need to "prime" the input buffer with a sequence of SPS+PPS+IFrame in order to initialize it so it can determine the video size.

@soliton4 https://github.com/soliton4 I'm not sure if this is true for this decoder. I used it a long time ago and heavily modified it for a specific purpose. I don't even have that code anymore to reference.

Anyways here are some resources on H264 video codec and MP4 containers:

https://stackoverflow.com/a/24890903 https://stackoverflow.com/a/24890903 - This is an amazing write up on the H264 formats (Annex B vs AVCC), how they store information, and how they differ.

-

Bitmovin's ultimate guide to container formats https://3411032.fs1.hubspotusercontent-na1.net/hubfs/3411032/Bitmovin_UltimateGuidetoContainerFormats_Whitepaper.pdf

Very very simple frame parser implementation https://stackoverflow.com/a/74040912

const soi = Buffer.from([0x00, 0x00, 0x00, 0x01]); function findStartFrame(buffer, i = -1) {

while ((i = buffer.indexOf(soi, i + 1)) !== -1) {

    if ((buffer[i + 4] & 0x1F) === 7) return i

}

return -1

}

— Reply to this email directly, view it on GitHub https://github.com/mbebenita/Broadway/issues/241#issuecomment-1435753153, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAIKIROAEQFOOGVMJ2SYUPDWYEQB5ANCNFSM54REHFCA . You are receiving this because you were mentioned.Message ID: @.***>

AnilSonix commented 1 year ago

Thanks for detailed answer. I will check this out to learn and understand better.

J7a4s0m5ine commented 1 year ago

thats the most detailed answer ever. is there an oscaars of the thread replies? cause u r nominated

Haha, this is one of those fields that's difficult to understand. If I can help some poor soul along I will.

@AnilSonix No problem, and good luck!