kaitai-io / kaitai_struct

Kaitai Struct: declarative language to generate binary data parsers in C++ / C# / Go / Java / JavaScript / Lua / Nim / Perl / PHP / Python / Ruby
https://kaitai.io
4.04k stars 198 forks source link

Can't find sub-sections in elf.ksy _debug metadata #880

Open hello-adam opened 3 years ago

hello-adam commented 3 years ago

I'm using the Kaitai Python runtime in https://github.com/Mahlet-Inc/hobbits to make a Kaitai runner and viewer plugin. A user pointed out that when running kaitai's executable/elf.ksy on something like libc.so, my viewer is missing the programHeaders, sectionHeaders, and strings parts of the header that show up in the kaitai web IDE

Screenshot from 2021-04-24 12-24-03

My issue is that I can't seem to find those fields anywhere in the _debug metadata produced by by the parser:

The root parsed object:

{ '_debug': defaultdict(<class 'dict'>,
                        { 'abi': {'end': 8, 'start': 7},
                          'abi_version': {'end': 9, 'start': 8},
                          'bits': {'end': 5, 'start': 4},
                          'ei_version': {'end': 7, 'start': 6},
                          'endian': {'end': 6, 'start': 5},
                          'header': {'end': 52, 'start': 16},
                          'magic': {'end': 4, 'start': 0},
                          'pad': {'end': 16, 'start': 9}}),
  '_io': <kaitaistruct.KaitaiStream object at 0x7fff98288070>,
  '_parent': None,
  '_root': <elf.Elf object at 0x7fff985f1c10>,
  'abi': <OsAbi.gnu: 3>,
  'abi_version': 0,
  'bits': <Bits.b32: 1>,
  'ei_version': 1,
  'endian': <Endian.le: 1>,
  'header': <elf.Elf.EndianElf object at 0x7fff6297ab20>,
  'magic': b'\x7fELF',
  'pad': b'\x00\x00\x00\x00\x00\x00\x00'}

The header object:

{ '_debug': defaultdict(<class 'dict'>,
                        { 'e_ehsize': {'end': 42, 'start': 40},
                          'e_type': {'end': 18, 'start': 16},
                          'e_version': {'end': 24, 'start': 20},
                          'entry_point': {'end': 28, 'start': 24},
                          'flags': {'end': 40, 'start': 36},
                          'machine': {'end': 20, 'start': 18},
                          'program_header_entry_size': {'end': 44, 'start': 42},
                          'program_header_offset': {'end': 32, 'start': 28},
                          'qty_program_header': {'end': 46, 'start': 44},
                          'qty_section_header': {'end': 50, 'start': 48},
                          'section_header_entry_size': {'end': 48, 'start': 46},
                          'section_header_offset': {'end': 36, 'start': 32},
                          'section_names_idx': {'end': 52, 'start': 50}}),
  '_io': <kaitaistruct.KaitaiStream object at 0x7fff98288070>,
  '_is_le': True,
  '_parent': <elf.Elf object at 0x7fff985f1c10>,
  '_root': <elf.Elf object at 0x7fff985f1c10>,
  'e_ehsize': 52,
  'e_type': <ObjType.shared: 3>,
  'e_version': 1,
  'entry_point': 127584,
  'flags': b'\x00\x00\x00\x00',
  'machine': <Machine.x86: 3>,
  'program_header_entry_size': 32,
  'program_header_offset': 52,
  'qty_program_header': 13,
  'qty_section_header': 67,
  'section_header_entry_size': 40,
  'section_header_offset': 2955588,
  'section_names_idx': 66}

Am I missing something somewhere?

generalmimon commented 3 years ago

@hello-adam First of all, the reason why you don't see values of instances or their _debug info is simply that they haven't been parsed at all. Unlike seq fields that all get parsed just by calling the _read method (which is done automatically by default), one of the fundamental properties of instances is that they are lazy. See https://doc.kaitai.io/user_guide.html#_instances_data_beyond_the_sequence:

Another very important difference between the seq attribute and the instances attribute is that instances are lazy by default. What does that mean? Unless someone would call that body getter method programmatically, no actual parsing of body would be done.

So if you want to read values of all of them, you need to eventually invoke them all. I guess the easiest method for Python is to use reflection on the generated parser classes to get the instance names for each subtype so that you can access them afterwards. I suppose that this is going to be quite easy to do, just find how to use reflection in Python - probably there is some single function that gives you all property names when you call it with the struct object as an argument. You will need to read all instances recursively, though - start with the top-level object and while you iterate over the properties, check the value of each one if it isn't a nested KaitaiStruct object (i.e. something like isinstance(struct[property], kaitaistruct.KaitaiStruct) and if it is, you need to recurse into it and do the same for properties of this nested object.


Second, if you want to get _debug info for instances, check the generated code about how the instance program_headers is parsed and what _debug info it stores - I marked the lines saving info to _debug map with an asterisk * (https://github.com/Mahlet-Inc/hobbits/blob/c51f39f/src/hobbits-plugins/analyzers/KaitaiStruct/ksy_py/executable/elf.py#L1572-L1610):

        @property
        def program_headers(self):
            if hasattr(self, '_m_program_headers'):
                return self._m_program_headers if hasattr(self, '_m_program_headers') else None

            _pos = self._io.pos()
            self._io.seek(self.program_header_offset)
        *   self._debug['_m_program_headers']['start'] = self._io.pos()
            if self._is_le:
                self._raw__m_program_headers = [None] * (self.qty_program_header)
                self._m_program_headers = [None] * (self.qty_program_header)
                for i in range(self.qty_program_header):
        *           if not 'arr' in self._debug['_m_program_headers']:
        *               self._debug['_m_program_headers']['arr'] = []
        *           self._debug['_m_program_headers']['arr'].append({'start': self._io.pos()})
                    self._raw__m_program_headers[i] = self._io.read_bytes(self.program_header_entry_size)
                    _io__raw__m_program_headers = KaitaiStream(BytesIO(self._raw__m_program_headers[i]))
                    _t__m_program_headers = Elf.EndianElf.ProgramHeader(_io__raw__m_program_headers, self, self._root, self._is_le)
                    _t__m_program_headers._read()
                    self._m_program_headers[i] = _t__m_program_headers
        *           self._debug['_m_program_headers']['arr'][i]['end'] = self._io.pos()

            else:
                # duplicate code from `if self._is_le` branch - I know the compiler
                # could do a better job of eliminating this, but it's not anywhere
                # high on our priorities I'd say, as long as the code works

        *   self._debug['_m_program_headers']['end'] = self._io.pos()
            self._io.seek(_pos)
            return self._m_program_headers if hasattr(self, '_m_program_headers') else None

The notable change from seq fields is that the key in the _debug map has the _m_ prefix, which is something you'll need to adapt your code to.

KOLANICH commented 3 years ago

I suppose that this is going to be quite easy to do, just find how to use reflection in Python - probably there is some single function that gives you all property names when you call it with the struct object as an argument.

Not quite - one also have to filter out all the builtin and inherited methods. It would be nice to generate an explicit tuple of all instances and also a method invoking parsing of them all.

You will need to read all instances recursively, though - start with the top-level object and while you iterate over the properties, check the value of each one if it isn't a nested KaitaiStruct object (i.e. something like isinstance(struct[property], kaitaistruct.KaitaiStruct) and if it is, you need to recurse into it and do the same for properties of this nested object.

And it is possible to get into infinite recursion when there are 2 types having instances referring each other.

The notable change from seq fields is that the key in the _debug map has the m prefix, which is something you'll need to adapt your code to.

I guess it may be better to fix the compiler.

generalmimon commented 3 years ago

@KOLANICH:

And it is possible to get into infinite recursion when there are 2 types having instances referring each other.

Um, you mean like

meta:
  id: test
seq:
  - id: top
    type: foo
types:
  foo:
    instances:
      recursive_ref:
        type: bar
  bar:
    instances:
      recursive_ref:
        type: foo

...? I'd say that this is a recursion by design, since it can be a perfectly legitimate thing to do (see the example in https://doc.kaitai.io/user_guide.html#_replacing_parent with a type node that references itself), and the onus is on the KSY author to add some ifs to ensure that it won't run indefinitely when you try to read all instances recursively. I don't know what you are getting at. I can't think of a special measure that would need to be done on the side of application code - for each KaitaiStruct object, you request the property names, filter out the builtin and inherited symbols as you've correctly pointed out so that you end up only with seq fields and instances, and check the value of each one if it's an instance of another KaitaiStruct object and recurse into if so. If a recursive descendant finally decides to end the chain, it will end up with if: false on the instance that usually holds the nested struct, so that getting its value will yield None, which is not an instance of KaitaiStruct, so the recursion will also stop here.


The notable change from seq fields is that the key in the _debug map has the m prefix, which is something you'll need to adapt your code to.

I guess it may be better to fix the compiler.

I agree - I've noticed this for the first time and it probably isn't intentional, as I can't think of any logical reason for this. The actual _m_-prefixed properties are internal and are not meant to be exposed (they should be private or protected in languages that support access modifiers), so I can't see why the _debug key should be this internal property name. I was just describing the current behavior as I found it, because people tend to be more interested in the present than the possible future.

This will be a potential breaking change for users making use of the _debug map, though...

hello-adam commented 3 years ago

wow, thanks for all of the info - I'll probably be able to get something working with this. I'll comment again if it's solved or if I get stuck.

KOLANICH commented 3 years ago

and the onus is on the KSY author to add some ifs to ensure that it won't run indefinitely

  1. for example PRNG would run if not indefinetely, but very long, so noone really takes measures for them to run non-indefinitely
  2. again, cross-refferences for records. There are 2 arrays one after and another (and may even be in completely different places), each record in one array has 1-to-1 correspondence to the record in another one

In fact to generate the positions we need only pos-instances, but we have to use pos-instances as a workaround to inavailability of typed value-instances ...

So, again, for me it looks like that for proper solution

hello-adam commented 3 years ago

I think I got this working better. Still getting some weird errors when I access those properties that do the lazy parsing though. One example:

Parsing <elf.Elf.EndianElf object at 0x7fff6d85c070> at 'header'
Parsing <elf.Elf.EndianElf.ProgramHeader object at 0x7fff6d85c160> at 'header._m_program_headers[0]'
Failed when getting property flags_obj: Traceback (most recent call last):
  File "/tmp/HobbitsPythonUffbRt/thescript.py", line 96, in parse_struct
    getattr(struct, attr)
  File "/tmp/hobbits-wolAVX/elf.py", line 1033, in flags_obj
    self._m_flags_obj = Elf.PhdrTypeFlags((self.flags64 | self.flags32), self._io, self, self._root)
AttributeError: 'ProgramHeader' object has no attribute 'flags64'

The elf.py there is the same one referenced previously.