kaitai-io / kaitai_struct_formats

Kaitai Struct: library of binary file formats (.ksy)
http://formats.kaitai.io
695 stars 201 forks source link

Add MPEG-TS format #147

Open kalidasya opened 5 years ago

kalidasya commented 5 years ago

Basic parsing is possible (@pavja2 has POC), but for a more effective parser it depends on https://github.com/kaitai-io/kaitai_struct/issues/196.

KOLANICH commented 5 years ago

Please use the form described in https://github.com/kaitai-io/kaitai_struct/issues/134 : 2 yaml blocks and a free-form text. For referencing an issue use

blocked_by:
  - 196

in the second block.

kalidasya commented 5 years ago

I spent a last couple of weeks on my first kaitai struct file for mpeg2 video. I have found it really hard to implement, and ultimately I have failed. First as I deal with mpegts files I did an mpeg-ts demuxer in kaitai (https://gist.github.com/kalidasya/ef5a7349aaf53073d2e0e16d3588f751) it was easy. But mpeg2 and potentially h264 will not be possible. Here are the obstacles I have encountered (in random order, sorry) Some useful links about mpeg2 video: http://dvd.sourceforge.net/dvdinfo/mpeghdrs.html#ext the kaitai struct: https://gist.github.com/kalidasya/f7ded118a8145b6f47a441bfc780de50

  1. lack of peek (in lot of cases you need to synchronise the data stream, mpeg2 video is built up of sequences which are delimited by 0x000001 without peek I had to use weird structures like:

    rest:
    seq:
      - id: data
        type: magic
        repeat: until
        # next line is ugly. next to the generic exit condition (prefix is 1 or eos reached)
        # we have to make sure we call the next_start_code for each magic otherwise it will not be available later
        # but we cannot do it when the eos was close so the data was not read
        repeat-until: (_.prefix_code == 1 or _io.size - _io.pos < 4) and (_io.size - _io.pos < 4 or _.next_start_code > -1)
    instances:
      next_start_code:
        value: data.last.next_start_code
    
    start_code:
    seq:
      - id: sync
        contents: [0x00, 0x00, 0x01]
      - id: start_code
        type: u1
    
    magic:
    seq:
      - id: stuff
        #contents: [0]
        type: u1
    instances:
      prefix_code:
        type: b24
        pos: _io.pos
        if: _io.size - _io.pos > 4
      next_start_code:
        type: u1
        io: _root._io
        pos: _io.pos + 3
        if: _io.size - _io.pos > 4

    so I can simulate some peek like behaviour. The goal is that you call the rest type if you just want to consume the remaining data in the stream. It will fail if there is no data remaining, and it will consume the next sequence. Lack of peek caused a lot of other troubles, mpeg2 video sequences has a pre-defined order, it means while I am parsing a sequence I have to read to the end of the sequence then parse another if its present, these kind of things are not possible (in the attached gist I ignored the sequence order completely)

  2. switch-on does not support ranges. mpeg2 sequence start codes are from 00 to FF with 4 big ranges, currently describing those would be really ugly, it would be nice to have something like

    types:
    sequence:
    seq:
      - id: start_code
        type: start_code
        if: _io.size - _io.pos > 4
      - id: data
        type:
          switch-on: start_code.start_code
          cases:
            0x00: picture_header
            0x01...0xAF: slice
            0xb2: user_data
            0xb3: sequence_header
            0xb5: extension_data
  3. keeping bit alignment. some data structure is depending on a bit read data like:

    seq:
      - id: start_code_id
        type: b4
      - id: data
        type:
          switch-on: start_code_id
          cases:
            0b0001: sequence_extension
            0b0010: sequence_display_extension
            0b1000: picture_coding_extension

    currently you have to merge these structures together as after reading the start_code_id you are not byte aligned but kaitai generates an align_to_byte call before the switch

  4. probably related to the lack of peek, but mpeg2 video has structures like: read 1 bit, if it is true, read 8 bits. check the next bit, if it is 1 repeat this cycle. if it is 0 read all 0 bits (but not the 0x000001 separator)

  5. cached instances, they caused a little headache in the peek simulation, it is not impossible to workaround but caused a few extra conditions in ifs

  6. in some cases it might had been easier to add some hack if the bits_left IO attribute is exposed in the strucutre. I do not feel it is harmful.

  7. some generic ignore eos, my struct file is full of weird ifs just to not fail if the input data is not fully finished.

  8. web ide: its extremely useful but the error messages are fully obfuscated (ksy errors as well, but mostly parsing errors), I had to compile with python all the time to figure out what went wrong. It would be super great if it can display the parsed objects so far (might not be possible)

  9. in python exception would be great if it prints out the position in the root io, I had to change the source all the time to see how much we parsed, how the error happened

  10. it is slow. While the mpeg-ts demuxer is faster then my original (what I wanted to replace) bytearray based solution, the mpeg2 video was significantly slower (on my test video it went from 2 sec to 10) partly it is because it is parsing more, but not that much. I am suspecting the lot of bit reads which are not effective. Maybe kaitai can combine bit reads into one struct.unpack command?

  11. contents is not supported for bit level types like:

    - id: sync_word
    contents: [true]

    or

    - id: sync_word
    contents: [0b1111111]
KOLANICH commented 5 years ago

All these points deserve an own issue in some kaitai-io repo.

BTW synalysis repo has some grammars for mpeg, you may find them useful.

GreyCat commented 5 years ago

@kalidasya Thanks for bringing this all together. A few comments of mine on these:

  1. lack of peek

A relatively complex problem. To the extent you've mentioned in this point, seeking the next sync point this way would be implemented using scanning pluggable algorithms, as per https://github.com/kaitai-io/kaitai_struct/issues/538 — there is even a proof-of-concept PR branch by @tinrodriguez8 — https://github.com/kaitai-io/kaitai_struct_compiler/pull/166 — but, unfortunately, it kind of stalled lately.

  1. switch-on does not support ranges

https://github.com/kaitai-io/kaitai_struct/issues/130

  1. keeping bit alignment

Yup, should be implemented in https://github.com/kaitai-io/kaitai_struct/issues/12 with align: bit

  1. probably related to the lack of peek, but mpeg2 video has structures like: read 1 bit, if it is true, read 8 bits. check the next bit, if it is 1 repeat this cycle. if it is 0 read all 0 bits (but not the 0x000001 separator)

Again, likely scanning algorithms would help, but just to clarify: what are you supposed to do with these bits afterwards? Do they form some kind of a value?

  1. cached instances, they caused a little headache in the peek simulation, it is not impossible to workaround but caused a few extra conditions in ifs

Cached instances are there for a reason. Just to clarify: you don't need/want them to be re-evaluated anywhere outside peeking scenario?

  1. in some cases it might had been easier to add some hack if the bits_left IO attribute is exposed in the strucutre. I do not feel it is harmful.

Yeah, given that we have _io.pos, it would totally make sense to have similar thing for bit-level positioning. Just added it as https://github.com/kaitai-io/kaitai_struct/issues/596

  1. some generic ignore eos, my struct file is full of weird ifs just to not fail if the input data is not fully finished.

That's pretty vague. The problem with "ignore EOS" is in defining what exactly happens when we hit an error, i.e. how do we recover. Do we stop this branch / all further processing to some extent, or do we continue, and, if we do, to what extent?

This is somewhat discussed in https://github.com/kaitai-io/kaitai_struct/issues/280, but I don't think we even have any solid proposals for an exception/recovery system so far :(

GreyCat commented 5 years ago
  1. web ide: its extremely useful but the error messages are fully obfuscated (ksy errors as well, but mostly parsing errors), I had to compile with python all the time to figure out what went wrong. It would be super great if it can display the parsed objects so far (might not be possible)

It is possible, but, unfortunately, situation with WebIDE is somewhat bad lately :( It asks for a major build system revamp, but nobody is interested in sitting down and redoing it, as it's not really a fun project. There is a stuck PR by @fudgepop01: https://github.com/kaitai-io/kaitai_struct_webide/pull/84 — I wasn't able to get it to build fully, and, alas, looks like @fudgepop01 have lost interest in it too, and now concentrates on his VSCode extension project (which might be a great next step).

  1. in python exception would be great if it prints out the position in the root io, I had to change the source all the time to see how much we parsed, how the error happened

I'm not sure about root IO (and you're probably not asking for _root._io.pos, right?), but it's actually a good feedback to introduce into further KaitaiStruct-specific exception system — i.e. that every exception happening should probably include IO position as a baseline requirement?

  1. it is slow. While the mpeg-ts demuxer is faster then my original (what I wanted to replace) bytearray based solution, the mpeg2 video was significantly slower (on my test video it went from 2 sec to 10) partly it is because it is parsing more, but not that much. I am suspecting the lot of bit reads which are not effective.

This is actually interesting, as it should not be. I suspect that implementation of "scanning" / "peeking" via creation of tons of objects in memory might be to blame. Can we investigate it more somehow?

Maybe kaitai can combine bit reads into one struct.unpack command?

I don't think that struct.unpack has any means to extract individual bytes, and current implementation kind of does that — i.e. it reads bytes, and then splits them into bits. May be we can optimize it for fixed location of bytes, so we won't have that many reads and/or condition checks, but I'm not sure if that's the main cuprit.

Again, we'll probably need to investigate/benchmark/profile it more. Are there any good tools out there for Python to do that?

  1. contents is not supported for bit level types like:

We'll likely have that as part of https://github.com/kaitai-io/kaitai_struct/issues/435 — i.e. as

- id: sync_word
  type: b7
  valid: 0b1111111
kalidasya commented 5 years ago

@kalidasya Thanks for bringing this all together. A few comments of mine on these:

  1. lack of peek

A relatively complex problem. To the extent you've mentioned in this point, seeking the next sync point this way would be implemented using scanning pluggable algorithms, as per kaitai-io/kaitai_struct#538 — there is even a proof-of-concept PR branch by @tinrodriguez8 — kaitai-io/kaitai_struct_compiler#166 — but, unfortunately, it kind of stalled lately.

I think that ticket handles it.

  1. switch-on does not support ranges

kaitai-io/kaitai_struct#130

  1. keeping bit alignment

Yup, should be implemented in kaitai-io/kaitai_struct#12 with align: bit

  1. probably related to the lack of peek, but mpeg2 video has structures like: read 1 bit, if it is true, read 8 bits. check the next bit, if it is 1 repeat this cycle. if it is 0 read all 0 bits (but not the 0x000001 separator)

Again, likely scanning algorithms would help, but just to clarify: what are you supposed to do with these bits afterwards? Do they form some kind of a value?

In this particular case the data can be dropped, but its more like the structure exist (I have not attempted a full mpeg2 parsing, only meta information (no picture data for example)

  1. cached instances, they caused a little headache in the peek simulation, it is not impossible to workaround but caused a few extra conditions in ifs

Cached instances are there for a reason. Just to clarify: you don't need/want them to be re-evaluated anywhere outside peeking scenario?

I think my issue was a combination of using pos: _io.pos + io:_root.io and evaluating it from outer scope, so in order to have data all the time later, I needed to hack the repeat to have in the if a boolean expression which is always called to cache the data immediatelly. actually now as I described it in this case it was good to have cache. maybe we can ignore this point until I come up with a better usecase.

  1. in some cases it might had been easier to add some hack if the bits_left IO attribute is exposed in the strucutre. I do not feel it is harmful.

Yeah, given that we have _io.pos, it would totally make sense to have similar thing for bit-level positioning. Just added it as kaitai-io/kaitai_struct#596

great!

  1. some generic ignore eos, my struct file is full of weird ifs just to not fail if the input data is not fully finished.

That's pretty vague. The problem with "ignore EOS" is in defining what exactly happens when we hit an error, i.e. how do we recover. Do we stop this branch / all further processing to some extent, or do we continue, and, if we do, to what extent?

sorry, I meant here only eos of the _root._io so basicaly a try-catch in the _read of the main class (this is how it looks in python)

This is somewhat discussed in kaitai-io/kaitai_struct#280, but I don't think we even have any solid proposals for an exception/recovery system so far :(

  1. web ide: its extremely useful but the error messages are fully obfuscated (ksy errors as well, but mostly parsing errors), I had to compile with python all the time to figure out what went wrong. It would be super great if it can display the parsed objects so far (might not be possible)

It is possible, but, unfortunately, situation with WebIDE is somewhat bad lately :( It asks for a major build system revamp, but nobody is interested in sitting down and redoing it, as it's not really a fun project. There is a stuck PR by @fudgepop01: kaitai-io/kaitai_struct_webide#84 — I wasn't able to get it to build fully, and, alas, looks like @fudgepop01 have lost interest in it too, and now concentrates on his VSCode extension project (which might be a great next step).

ok, thats understandable, the web ide still provides a lot of value

  1. in python exception would be great if it prints out the position in the root io, I had to change the source all the time to see how much we parsed, how the error happened

I'm not sure about root IO (and you're probably not asking for _root._io.pos, right?), but it's actually a good feedback to introduce into further KaitaiStruct-specific exception system — i.e. that every exception happening should probably include IO position as a baseline requirement?

in my case I was always interested in the root io.pos and often the bits_left attribute, but indeed giving some clue where the error happened in the bytestream would help a lot. Not sure for non root io would it help to me to indentify the bytes I failed to parse. But it was just me, maybe my heruistics were suboptimal.

  1. it is slow. While the mpeg-ts demuxer is faster then my original (what I wanted to replace) bytearray based solution, the mpeg2 video was significantly slower (on my test video it went from 2 sec to 10) partly it is because it is parsing more, but not that much. I am suspecting the lot of bit reads which are not effective.

This is actually interesting, as it should not be. I suspect that implementation of "scanning" / "peeking" via creation of tons of objects in memory might be to blame. Can we investigate it more somehow?

True, that can be the other culprit, I will try to investigate it more, in general it is just reading byte after byte as the other implementation did, so I ruled that out, but maybe I was wrong

Maybe kaitai can combine bit reads into one struct.unpack command?

I don't think that struct.unpack has any means to extract individual bytes, and current implementation kind of does that — i.e. it reads bytes, and then splits them into bits. May be we can optimize it for fixed location of bytes, so we won't have that many reads and/or condition checks, but I'm not sure if that's the main cuprit.

indeed, its a heatwave here, I hallucinated a bit level struct, yes thats for bytes only. I will try to figure out where do we spend our times with profiling the execution

Again, we'll probably need to investigate/benchmark/profile it more. Are there any good tools out there for Python to do that?

yes there are, I used it long ago so I have to refresh my knowledge

  1. contents is not supported for bit level types like:

We'll likely have that as part of kaitai-io/kaitai_struct#435 — i.e. as

- id: sync_word
  type: b7
  valid: 0b1111111
kalidasya commented 5 years ago

So what does not have an issue linked:

  1. probably related to the lack of peek, but mpeg2 video has structures like: read 1 bit, if it is true, read 8 bits. check the next bit, if it is 1 repeat this cycle. if it is 0 read all 0 bits (but not the 0x000001 separator)
  2. some generic ignore eos, my struct file is full of weird ifs just to not fail if the input data is not fully finished.
  3. in python exception would be great if it prints out the position in the root io, I had to change the source all the time to see how much we parsed, how the error happened
  4. it is slow. While the mpeg-ts demuxer is faster then my original (what I wanted to replace) bytearray based solution, the mpeg2 video was significantly slower (on my test video it went from 2 sec to 10) partly it is because it is parsing more, but not that much. I am suspecting the lot of bit reads which are not effective.

I will take 10. and comment here my findings with the profiler

kalidasya commented 5 years ago

it seems @GreyCat is right, I have 15 sec cumulative and 7 sec was spent in the prefix_code property. So a different seeking might help a lot. I will try to figure out if I can narrow it down more.

Next is 3 sec cumulative in kaitaistruct.read_bits_int (called more time than prefix_code) and after that comes kaitaistruct.size with 2 sec cumulative.

in the read_bits_int it seems out of the 2.4 sec cumulative 1.8 is in the function itself (read_bytes is 0.4 and isinstance is 0.2 only) maybe some optimisation can happen there, but I guess the whole seeking should be addressed then we will see the performance

avi-techno commented 4 years ago

Sir I am new to Kaitai, I want to parse pcap file which has tcp/IP flow info, I am java for programming . I have seen your video in Youtube regarding media parsing, kindly let me know how to parse pcap file

kalidasya commented 4 years ago

@avi-techno what does it have to do with mpeg-ts?

SarotecK commented 1 month ago

Is mpeg working nowadays?