Closed AnilSonix closed 1 year ago
This would be fairly easy if you understand what a h264 stream "looks like," and it's format/structure.
0x000001 or 0x00000001, is placed at the beginning of each NAL unit.
To extract the frames you would read the stream until you find the beginning of the next NAL unit. So you have a start byte where you identified the end or start of a frame, lets say it's byte 256, you then continue reading the stream until you find the next 0x000001 or 0x00000001,
which signifies the beginning of the next frame. Let's say this header is found in byte 512. You now know there is a fully encapsulated frame between bytes 256 and 512 in the stream, and the next frame starts at bye 512.
From this point it's all data and memory management on where and how you want to save the extracted frames.
This would be fairly easy if you understand what a h264 stream "looks like," and it's format/structure.
0x000001 or 0x00000001, is placed at the beginning of each NAL unit.
To extract the frames you would read the stream until you find the beginning of the next NAL unit. So you have a start byte where you identified the end or start of a frame, lets say it's byte 256, you then continue reading the stream until you find the next
0x000001 or 0x00000001,
which signifies the beginning of the next frame. Let's say this header is found in byte 512. You now know there is a fully encapsulated frame between bytes 256 and 512 in the stream, and the next frame starts at bye 512.From this point it's all data and memory management on where and how you want to save the extracted frames.
Thanks for replying. Looks like I'm not smart enough ☺️ to understand this fully. Could you point me where to get started video processing and codecs etc in general. This is very new to me.
this is an example where i decode a raw h264 stream and split it on the 0x00000001 markers it is best to feed complete nals to the decoder, however i believe the latest version is doing nal splitting internaly
@AnilSonix
Thanks for replying. Looks like I'm not smart enough ☺️ to understand this fully
Sorry, I didn't mean it in that manner. When I first approached video streams, and encoding/decoding, it all looked alien. My point was more that if you understand the underlying streams you'll be able to understand what's going on with this mess of media coding and containerization. It's a fairly complex topic, and as our needs of bandwidth savings increase, it will continue to become more complex. But let's forget about that for now and I'll get back on topic:
Finding the h264 NALUs (frames) is essentially searching through an array buffer for a pattern, in this case the pattern is the three-byte or four-byte start code. If you don't find the pattern among the current stream data, you append it to an intermediary buffer and continue to read the source stream. This is one pattern to achieve frame parsing. There are times when the start code is different depending upon the h264 format/profile aswell....see the links and information below.
https://github.com/soliton4/nodeMirror/blob/cf7db884e61919f1efec92fb3601585c8e3c8f12/src/avc/Wgt.js#L291 this is an example where i decode a raw h264 stream and split it on the 0x00000001 markers
@soliton4 's code linked is a great example parsing a stream for NALus
To extrapolate their code against what I said above, they are getting the data from the media source and looping through that array to find the start sequence. When no start code is found, they append the data to the temp buffer and continue to read from stream.
I'm over simplifying here a bit, their code is multithreaded possibly using webworkers and there's a little more going on than what I alluded to. There's also some logic to determine the case where nothing currently exists in the temp buffer and we find a NALu, which would mean it's either the first frame, or we already sent the previous frame to the decoder by the time the start sequence was found. But for all intents and purposes I can simplify the explanation a bit.
I've added some comments for clarity and explanation
var b = 0;
var l = data.length; // get length of the incoming data
var zeroCnt = 0;
for (b; b < l; ++b){ // for-loop that uses a zeroCnt variable to keep track of contiguous zeros
if (data[b] === 0){
zeroCnt++;
}else{
if (data[b] == 1){
if (zeroCnt >= 3){ // at least 3 contiguous zeros were found!
hit(b - 3); // we send the offset location to the "hit" function so it can process the current temp buffer and combine the frame data
break;
};
};
zeroCnt = 0;
};
};
if (!foundHit){
this.bufferAr.push(data); // No start code was found, continue pushing data to temp buffer
};
}
In the case a start code is found we note the exact position it occurs (the offset position in the data
buffer) and create a subarray with everything leading up to the offset; everything before the offset position is apart of the previous frame and is concatenated together with the existing temp buffer (bufferAr
) and sent to the decoder as a whole frame. The temp buffer is then cleared and everything that was following the offset in the original stream buffer is pushed to the temp buffer to start the loop process over again.
var hit = function(offset){
foundHit = true;
// pass subarray at the offset where the start code was found
self.bufferAr.push(data.subarray(0, offset));
// concat the two arrays and push to the decoder
self.decode( concatUint8(self.bufferAr) );
// clear the temp buffer
self.bufferAr = [];
// Push the second portion of the sliced array to the temp buffer
self.bufferAr.push(data.subarray(offset));
};
@OllieJones has a library that has a bunch of H264 functionality including searching arrays/streams for frames. Take a look at that repo as a whole, definitely read the README. I linked to a specific portion of their README
because it explains the nuances with H264 streams and how they sometimes have different formats for frame separators.
In Ollie's repo they are converting from one media format (webm) to another media container format (mp4) by extracting raw NALus (videoframes + extra data + stream info) from webm and "boxing" those NALUs in mp4's container format.
This is a port of the original FFMPEG code back in 2012
I'm using this example because it's a completely different thought pattern on how a frame parser could be architected. They're heavily using bit shifting while looking for the start sequence.
The frame parsing in this example starts at this try
block. You can see they start reading in the file on L139 and walk back and forth through several while loops to do the decoding while using isEndOfFrame
as a decision making point and bit shifting to find the sequence.
private boolean isEndOfFrame(int code) {
int nal = code & 0x1F;
if (nal == NAL_AUD) {
foundFrameStart = false;
return true;
}
boolean foundFrame = foundFrameStart;
if (nal == NAL_SLICE || nal == NAL_IDR_SLICE) {
if (foundFrameStart) {
return true;
}
foundFrameStart = true;
} else {
foundFrameStart = false;
}
return foundFrame;
}
It has wasm, ios, c++ and java h264 decoder variants plus some extra goodies
I have to cut this short for now and step away from the computer for a bit, if you have more questions or anything feel free to ask. Here's some reading to catch you up on H264 formats, and the like. There's more to decoding than identifying the frames, for example depending upon the decoder you need to "prime" the input buffer with a sequence of SPS+PPS+IFrame in order to initialize it so it can determine the video size.
@soliton4 I'm not sure if this is true for this decoder. I used it a long time ago and heavily modified it for a specific purpose. I don't even have that code anymore to reference.
Anyways here are some resources on H264 video codec and MP4 containers:
https://stackoverflow.com/a/24890903 - This is an amazing write up on the H264 formats (Annex B vs AVCC), how they store information, and how they differ.
Bitmovin's ultimate guide to container formats - not specifically about H264 but there's great info about media container formats
Very very simple frame parser implementation
const soi = Buffer.from([0x00, 0x00, 0x00, 0x01]);
function findStartFrame(buffer, i = -1) {
while ((i = buffer.indexOf(soi, i + 1)) !== -1) {
if ((buffer[i + 4] & 0x1F) === 7) return i
}
return -1
}
thats the most detailed answer ever. is there an oscaars of the thread replies? cause u r nominated
On Sat, 18 Feb 2023, 20:35 C9, @.***> wrote:
@AnilSonix https://github.com/AnilSonix
Thanks for replying. Looks like I'm not smart enough ☺️ to understand this fully
Sorry, I didn't mean it in that manner. When I first approached video streams, and encoding/decoding, it all looked alien. My point was more that if you understand the underlying streams you'll be able to understand what's going on with this mess of media coding and containerization. It's a fairly complex topic, and as our needs of bandwidth savings increase, it will continue to become more complex. But let's forget about that for now and I'll get back on topic: Finding frames - your original question
Finding the h264 NALUs (frames) is essentially searching through an array buffer for a pattern, in this case the pattern is the three-byte or four-byte start code. If you don't find the pattern among the current stream data, you append it to an intermediary buffer and continue to read the source stream. This is one pattern to achieve frame parsing. There are times when the start code is different depending upon the h264 format/profile aswell....see the links and information below.
https://github.com/soliton4/nodeMirror/blob/cf7db884e61919f1efec92fb3601585c8e3c8f12/src/avc/Wgt.js#L291 this is an example where i decode a raw h264 stream and split it on the 0x00000001 markers
@soliton4 https://github.com/soliton4 's code linked is a great example parsing a stream for NALus
- Loop the incoming data stream to find a start sequence (00x0,00x0,00x0,00x1)
To extrapolate their code against what I said above, they are getting the data from the media source and looping through that array https://github.com/soliton4/nodeMirror/blob/cf7db884e61919f1efec92fb3601585c8e3c8f12/src/avc/Wgt.js#L305 to find the start sequence. When no start code is found, they append the data to the temp buffer and continue to read from stream.
I'm over simplifying here a bit, their code is multithreaded possibly using webworkers and there's a little more going on than what I alluded to. There's also some logic to determine the case where nothing currently exists in the temp buffer and we find a NALu, which would mean it's either the first frame, or we already sent the previous frame to the decoder by the time the start sequence was found. But for all intents and purposes I can simplify the explanation a bit.
I've added some comments for clarity and explanation
var b = 0;
var l = data.length; // get length of the incoming data
var zeroCnt = 0;
for (b; b < l; ++b){ // for-loop that uses a zeroCnt variable to keep track of contiguous zeros
if (data[b] === 0){ zeroCnt++; }else{ if (data[b] == 1){ if (zeroCnt >= 3){ // at least 3 contiguous zeros were found! hit(b - 3); // we send the offset location to the "hit" function so it can process the current temp buffer and combine the frame data break; }; }; zeroCnt = 0; };
};
if (!foundHit){
this.bufferAr.push(data); // No start code was found, continue pushing data to temp buffer
};
}
- So a start code was found while we were looping
In the case a start code is found we note the exact position it occurs (the offset position in the data buffer) and create a subarray with everything leading up to the offset; everything before the offset position is apart of the previous frame and is concatenated together with the existing temp buffer (bufferAr) and sent to the decoder as a whole frame. The temp buffer is then cleared and everything that was following the offset in the original stream buffer is pushed to the temp buffer to start the loop process over again.
var hit = function(offset){
foundHit = true;
// pass subarray at the offset where the start code was found
self.bufferAr.push(data.subarray(0, offset));
// concat the two arrays and push to the decoder
self.decode( concatUint8(self.bufferAr) );
// clear the temp buffer
self.bufferAr = [];
// Push the second portion of the sliced array to the temp buffer
self.bufferAr.push(data.subarray(offset)); };
Other implementations that might be helpful to see
@OllieJones https://github.com/OllieJones has a library https://github.com/OllieJones/h264-interp-utils#nalustream that has a bunch of H264 functionality including searching arrays/streams for frames. Take a look at that repo as a whole, definitely read the README. I linked to a specific portion of their README because it explains the nuances with H264 streams and how they sometimes have different formats for frame separators.
In Ollie's repo they are converting from one media format (webm) to another media container format (mp4) by extracting raw NALus (videoframes + extra data + stream info) from webm and "boxing" those NALUs in mp4's container format. Another implementation in Java https://github.com/twilightdema/h264j/blob/3dd2cc2e65e653ecbba247ed95a0bff901c98007/h264j/src/main/java/com/twilight/h264/player/H264Player.java
This is a port of the original FFMPEG code back in 2012
I'm using this example because it's a completely different thought pattern on how a frame parser could be architected. They're heavily using bit shifting while looking for the start sequence.
The frame parsing in this example starts at this try block https://github.com/twilightdema/h264j/blob/3dd2cc2e65e653ecbba247ed95a0bff901c98007/h264j/src/main/java/com/twilight/h264/player/H264Player.java#L137-244. You can see they start reading in the file on L139 and walk back and forth through several while loops to do the decoding while using isEndOfFrame as a decision making point and bit shifting to find the sequence.
private boolean isEndOfFrame(int code) { int nal = code & 0x1F;
if (nal == NAL_AUD) {
foundFrameStart = false;
return true;
}
boolean foundFrame = foundFrameStart; if (nal == NAL_SLICE || nal == NAL_IDR_SLICE) {
if (foundFrameStart) {
return true;
}
foundFrameStart = true;
} else {
foundFrameStart = false;
}
return foundFrame;
}
Lastly a project that was inspired by Broadway https://github.com/oneam/h264bsd
It has wasm, ios, c++ and java h264 decoder variants plus some extra goodies Decent Wikipedia/articles/documentation
I have to cut this short for now and step away from the computer for a bit, if you have more questions or anything feel free to ask. Here's some reading to catch you up on H264 formats, and the like. There's more to decoding than identifying the frames, for example depending upon the decoder you need to "prime" the input buffer with a sequence of SPS+PPS+IFrame in order to initialize it so it can determine the video size.
@soliton4 https://github.com/soliton4 I'm not sure if this is true for this decoder. I used it a long time ago and heavily modified it for a specific purpose. I don't even have that code anymore to reference.
Anyways here are some resources on H264 video codec and MP4 containers:
https://stackoverflow.com/a/24890903 https://stackoverflow.com/a/24890903 - This is an amazing write up on the H264 formats (Annex B vs AVCC), how they store information, and how they differ.
-
Bitmovin's ultimate guide to container formats https://3411032.fs1.hubspotusercontent-na1.net/hubfs/3411032/Bitmovin_UltimateGuidetoContainerFormats_Whitepaper.pdf
- not specifically about H264 but there's great info about media container formats
ITU-T H264 Profile Spec PDF https://www.itu.int/rec/T-REC-H.264-202108-I/en/wp_h264_31669_en_0803_lo.pdf
H264 profile list https://en.wikipedia.org/wiki/Advanced_Video_Coding#Profiles
video Codecs - Mozilla MDN https://developer.mozilla.org/en-US/docs/Web/Media/Formats/Video_codecs
Media Container Formats - Mozilla MDN https://developer.mozilla.org/en-US/docs/Web/Media/Formats/Containers
Network Abastraction Layer (NALu) https://en.wikipedia.org/wiki/
Parameter Sets (SPS/PPS) https://en.wikipedia.org/wiki/Network_Abstraction_Layer#Parameter_Sets
Getting frames from an RTSP source explained https://stackoverflow.com/a/7668578
https://stackoverflow.com/a/7668578 https://stackoverflow.com/a/7668578
Very very simple frame parser implementation https://stackoverflow.com/a/74040912
const soi = Buffer.from([0x00, 0x00, 0x00, 0x01]); function findStartFrame(buffer, i = -1) {
while ((i = buffer.indexOf(soi, i + 1)) !== -1) { if ((buffer[i + 4] & 0x1F) === 7) return i } return -1
}
— Reply to this email directly, view it on GitHub https://github.com/mbebenita/Broadway/issues/241#issuecomment-1435753153, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAIKIROAEQFOOGVMJ2SYUPDWYEQB5ANCNFSM54REHFCA . You are receiving this because you were mentioned.Message ID: @.***>
Thanks for detailed answer. I will check this out to learn and understand better.
thats the most detailed answer ever. is there an oscaars of the thread replies? cause u r nominated …
Haha, this is one of those fields that's difficult to understand. If I can help some poor soul along I will.
@AnilSonix No problem, and good luck!
Can anyone provide a code sample to extract all the frames from a video? I'm not able to get it done via Decoder.js (Could find it in the docs)