kaitai-io / kaitai_struct

Kaitai Struct: declarative language to generate binary data parsers in C++ / C# / Go / Java / JavaScript / Lua / Nim / Perl / PHP / Python / Ruby
https://kaitai.io
4.04k stars 199 forks source link

python: explicit typecast has no effect #1017

Closed milahu closed 1 year ago

milahu commented 1 year ago

im trying to use sqlite3.ksy in python

this breaks when i access

import parser.sqlite3 as parser_sqlite3
db = parser_sqlite3.Sqlite3.from_file("test.db")
cell = db.root_page.cells[0]
cell.body
#     if (self.serial_type.code.value >= 1) and (
# AttributeError: 'VlqBase128Be' object has no attribute 'code'. Did you mean: 'close'?
```py # create test database import sqlite3 con = sqlite3.connect(database) cur = con.cursor() cur.execute("CREATE TABLE IF NOT EXISTS lang(name, first_appeared)") data = [ ("C++", 1985), ("Objective-C", 1984), ] cur.executemany("INSERT INTO lang(name, first_appeared) VALUES(?, ?)", data) con.commit() con.close() # read database # sqlite3.py is generated by kaitai-struct-compiler import parser.sqlite3 as parser_sqlite3 db = parser_sqlite3.Sqlite3.from_file("test.db") cell = db.root_page.cells[0] cell.body # if (self.serial_type.code.value >= 1) and ( # AttributeError: 'VlqBase128Be' object has no attribute 'code'. Did you mean: 'close'? ```

problem: value: ser.as<serial> does not produce a typecast

self._m_serial_type = self.ser

expected: something like

_pos = self.ser._io.pos() # FIXME wrong position
ser_len = 1 # TODO dynamic
self.ser._io.seek(_pos - ser_len)
self._m_serial_type = Sqlite3.Serial(self.ser._io, self.ser, self.ser._root)
#self.ser._io.seek(_pos)

... but i would need the original IO position of self.ser to read the orignal bytes or i would need a to_bytes method: Sqlite3.Serial.from_bytes(self.ser.to_bytes())

self.ser is the raw value, for example self.ser = 23 self.serial_type is the interpreted value, for example self.serial_type.is_blob = True

sqlite3.kty

  column_content:
    params:
      - id: ser
        type: struct
    seq:
      - id: as_int
        type:
          switch-on: serial_type.code.value
          cases:
            1: u1
            2: u2
            # ...
        if: serial_type.code.value >= 1 and serial_type.code.value <= 6
      - id: as_float
        type: f8
        if: serial_type.code.value == 7
      # ...
    instances:
      serial_type:
        value: ser.as<serial>

sqlite3.py

    class ColumnContent(KaitaiStruct):
        # ...
        @property
        def serial_type(self):
            if hasattr(self, "_m_serial_type"):
                return self._m_serial_type

            self._m_serial_type = self.ser
            return getattr(self, "_m_serial_type", None)

the typecast seems to be working in java: this.serialType = ((Sqlite3.Serial) (ser()));

Sqlite3.java

    public static class ColumnContent extends KaitaiStruct {
        // ...
        private Sqlite3.Serial serialType;
        public Sqlite3.Serial serialType() {
            if (this.serialType != null)
                return this.serialType;
            this.serialType = ((Sqlite3.Serial) (ser()));
            return this.serialType;
        }

im using kaitai-struct-compiler version 0.10 to generate sqlite3.py

sqlite3.py ```py # This is a generated file! Please edit source .ksy file and use kaitai-struct-compiler to rebuild import kaitaistruct from kaitaistruct import KaitaiStruct, KaitaiStream, BytesIO from enum import Enum if getattr(kaitaistruct, "API_VERSION", (0, 9)) < (0, 9): raise Exception( "Incompatible Kaitai Struct Python API: 0.9 or later is required, but you have %s" % (kaitaistruct.__version__) ) from . import vlq_base128_be class Sqlite3(KaitaiStruct): """SQLite3 is a popular serverless SQL engine, implemented as a library to be used within other applications. It keeps its databases as regular disk files. Every database file is segmented into pages. First page (starting at the very beginning) is special: it contains a file-global header which specifies some data relevant to proper parsing (i.e. format versions, size of page, etc). After the header, normal contents of the first page follow. Each page would be of some type, and generally, they would be reached via the links starting from the first page. First page type (`root_page`) is always "btree_page". .. seealso:: Source - https://www.sqlite.org/fileformat.html """ class Versions(Enum): legacy = 1 wal = 2 class Encodings(Enum): utf_8 = 1 utf_16le = 2 utf_16be = 3 def __init__(self, _io, _parent=None, _root=None): self._io = _io self._parent = _parent self._root = _root if _root else self self._read() def _read(self): self.magic = self._io.read_bytes(16) if ( not self.magic == b"\x53\x51\x4C\x69\x74\x65\x20\x66\x6F\x72\x6D\x61\x74\x20\x33\x00" ): raise kaitaistruct.ValidationNotEqualError( b"\x53\x51\x4C\x69\x74\x65\x20\x66\x6F\x72\x6D\x61\x74\x20\x33\x00", self.magic, self._io, "/seq/0", ) self.len_page_mod = self._io.read_u2be() self.write_version = KaitaiStream.resolve_enum( Sqlite3.Versions, self._io.read_u1() ) self.read_version = KaitaiStream.resolve_enum( Sqlite3.Versions, self._io.read_u1() ) self.reserved_space = self._io.read_u1() self.max_payload_frac = self._io.read_u1() self.min_payload_frac = self._io.read_u1() self.leaf_payload_frac = self._io.read_u1() self.file_change_counter = self._io.read_u4be() self.num_pages = self._io.read_u4be() self.first_freelist_trunk_page = self._io.read_u4be() self.num_freelist_pages = self._io.read_u4be() self.schema_cookie = self._io.read_u4be() self.schema_format = self._io.read_u4be() self.def_page_cache_size = self._io.read_u4be() self.largest_root_page = self._io.read_u4be() self.text_encoding = KaitaiStream.resolve_enum( Sqlite3.Encodings, self._io.read_u4be() ) self.user_version = self._io.read_u4be() self.is_incremental_vacuum = self._io.read_u4be() self.application_id = self._io.read_u4be() self.reserved = self._io.read_bytes(20) self.version_valid_for = self._io.read_u4be() self.sqlite_version_number = self._io.read_u4be() self.root_page = Sqlite3.BtreePage(self._io, self, self._root) class Serial(KaitaiStruct): def __init__(self, _io, _parent=None, _root=None): self._io = _io self._parent = _parent self._root = _root if _root else self self._read() def _read(self): self.code = vlq_base128_be.VlqBase128Be(self._io) @property def is_blob(self): if hasattr(self, "_m_is_blob"): return self._m_is_blob self._m_is_blob = (self.code.value >= 12) and ((self.code.value % 2) == 0) return getattr(self, "_m_is_blob", None) @property def is_string(self): if hasattr(self, "_m_is_string"): return self._m_is_string self._m_is_string = (self.code.value >= 13) and ((self.code.value % 2) == 1) return getattr(self, "_m_is_string", None) @property def len_content(self): if hasattr(self, "_m_len_content"): return self._m_len_content if self.code.value >= 12: self._m_len_content = (self.code.value - 12) // 2 return getattr(self, "_m_len_content", None) class BtreePage(KaitaiStruct): def __init__(self, _io, _parent=None, _root=None): self._io = _io self._parent = _parent self._root = _root if _root else self self._read() def _read(self): self.page_type = self._io.read_u1() self.first_freeblock = self._io.read_u2be() self.num_cells = self._io.read_u2be() self.ofs_cells = self._io.read_u2be() self.num_frag_free_bytes = self._io.read_u1() if (self.page_type == 2) or (self.page_type == 5): self.right_ptr = self._io.read_u4be() self.cells = [] for i in range(self.num_cells): self.cells.append(Sqlite3.RefCell(self._io, self, self._root)) class CellIndexLeaf(KaitaiStruct): """ .. seealso:: Source - https://www.sqlite.org/fileformat.html#b_tree_pages """ def __init__(self, _io, _parent=None, _root=None): self._io = _io self._parent = _parent self._root = _root if _root else self self._read() def _read(self): self.len_payload = vlq_base128_be.VlqBase128Be(self._io) self._raw_payload = self._io.read_bytes(self.len_payload.value) _io__raw_payload = KaitaiStream(BytesIO(self._raw_payload)) self.payload = Sqlite3.CellPayload(_io__raw_payload, self, self._root) class Serials(KaitaiStruct): def __init__(self, _io, _parent=None, _root=None): self._io = _io self._parent = _parent self._root = _root if _root else self self._read() def _read(self): self.entries = [] i = 0 while not self._io.is_eof(): self.entries.append(vlq_base128_be.VlqBase128Be(self._io)) i += 1 class CellTableLeaf(KaitaiStruct): """ .. seealso:: Source - https://www.sqlite.org/fileformat.html#b_tree_pages """ def __init__(self, _io, _parent=None, _root=None): self._io = _io self._parent = _parent self._root = _root if _root else self self._read() def _read(self): self.len_payload = vlq_base128_be.VlqBase128Be(self._io) self.row_id = vlq_base128_be.VlqBase128Be(self._io) self._raw_payload = self._io.read_bytes(self.len_payload.value) _io__raw_payload = KaitaiStream(BytesIO(self._raw_payload)) self.payload = Sqlite3.CellPayload(_io__raw_payload, self, self._root) class CellPayload(KaitaiStruct): """ .. seealso:: Source - https://sqlite.org/fileformat2.html#record_format """ def __init__(self, _io, _parent=None, _root=None): self._io = _io self._parent = _parent self._root = _root if _root else self self._read() def _read(self): self.len_header_and_len = vlq_base128_be.VlqBase128Be(self._io) self._raw_column_serials = self._io.read_bytes( (self.len_header_and_len.value - 1) ) _io__raw_column_serials = KaitaiStream(BytesIO(self._raw_column_serials)) self.column_serials = Sqlite3.Serials( _io__raw_column_serials, self, self._root ) self.column_contents = [] for i in range(len(self.column_serials.entries)): self.column_contents.append( Sqlite3.ColumnContent( self.column_serials.entries[i], self._io, self, self._root ) ) class CellTableInterior(KaitaiStruct): """ .. seealso:: Source - https://www.sqlite.org/fileformat.html#b_tree_pages """ def __init__(self, _io, _parent=None, _root=None): self._io = _io self._parent = _parent self._root = _root if _root else self self._read() def _read(self): self.left_child_page = self._io.read_u4be() self.row_id = vlq_base128_be.VlqBase128Be(self._io) class CellIndexInterior(KaitaiStruct): """ .. seealso:: Source - https://www.sqlite.org/fileformat.html#b_tree_pages """ def __init__(self, _io, _parent=None, _root=None): self._io = _io self._parent = _parent self._root = _root if _root else self self._read() def _read(self): self.left_child_page = self._io.read_u4be() self.len_payload = vlq_base128_be.VlqBase128Be(self._io) self._raw_payload = self._io.read_bytes(self.len_payload.value) _io__raw_payload = KaitaiStream(BytesIO(self._raw_payload)) self.payload = Sqlite3.CellPayload(_io__raw_payload, self, self._root) class ColumnContent(KaitaiStruct): def __init__(self, ser, _io, _parent=None, _root=None): self._io = _io self._parent = _parent self._root = _root if _root else self self.ser = ser self._read() def _read(self): if (self.serial_type.code.value >= 1) and ( self.serial_type.code.value <= 6 ): _on = self.serial_type.code.value if _on == 4: self.as_int = self._io.read_u4be() elif _on == 6: self.as_int = self._io.read_u8be() elif _on == 1: self.as_int = self._io.read_u1() elif _on == 3: self.as_int = self._io.read_bits_int_be(24) elif _on == 5: self.as_int = self._io.read_bits_int_be(48) elif _on == 2: self.as_int = self._io.read_u2be() if self.serial_type.code.value == 7: self.as_float = self._io.read_f8be() if self.serial_type.is_blob: self.as_blob = self._io.read_bytes(self.serial_type.len_content) self.as_str = (self._io.read_bytes(self.serial_type.len_content)).decode( "UTF-8" ) @property def serial_type(self): if hasattr(self, "_m_serial_type"): return self._m_serial_type self._m_serial_type = self.ser return getattr(self, "_m_serial_type", None) class RefCell(KaitaiStruct): def __init__(self, _io, _parent=None, _root=None): self._io = _io self._parent = _parent self._root = _root if _root else self self._read() def _read(self): self.ofs_body = self._io.read_u2be() @property def body(self): if hasattr(self, "_m_body"): return self._m_body _pos = self._io.pos() self._io.seek(self.ofs_body) _on = self._parent.page_type if _on == 13: self._m_body = Sqlite3.CellTableLeaf(self._io, self, self._root) elif _on == 5: self._m_body = Sqlite3.CellTableInterior(self._io, self, self._root) elif _on == 10: self._m_body = Sqlite3.CellIndexLeaf(self._io, self, self._root) elif _on == 2: self._m_body = Sqlite3.CellIndexInterior(self._io, self, self._root) self._io.seek(_pos) return getattr(self, "_m_body", None) @property def len_page(self): if hasattr(self, "_m_len_page"): return self._m_len_page self._m_len_page = 65536 if self.len_page_mod == 1 else self.len_page_mod return getattr(self, "_m_len_page", None) ```
generalmimon commented 1 year ago

@milahu:

problem: value: ser.as<serial> does not produce a typecast

self._m_serial_type = self.ser

This is correct, it's working as expected. Python is dynamically typed, so if you believe that you have an object of particular type in a variable, you can immediately access the properties specific to that type. This is in contrary to statically typed languages like Java where if you know that a variable of the general KaitaiStruct type currently holds an object of more specific type on which you would like to access a property, you must first do the type conversion (see https://docs.oracle.com/javase/specs/jls/se8/html/jls-5.html), otherwise the Java compiler will give you a compile error.

expected: something like

_pos = self.ser._io.pos()
ser_len = 1 # TODO dynamic
self.ser._io.seek(_pos - ser_len)
self._m_serial_type = Sqlite3.Serial(self.ser._io, self.ser, self.ser._root)
#self.ser._io.seek(_pos)

This is a misunderstanding of the type cast operation - a type cast should never do anything like this. You think that:

the typecast seems to be working in java: this.serialType = ((Sqlite3.Serial) (ser()));

and in a way, yes, the generated code also looks as I'd expect (as in Python), but if you actually run the Java code, you'll get a ClassCastException for the same reason you got the AttributeError in Python (just the error is thrown on the type cast already, not on the attribute access) - the code expected that the real type of ser is Serial, but it is actually VlqBase128Be and thus the type conversion failed.

The real issue is that sqlite3.ksy is wrong and needs to be fixed (thanks for discovering and reporting this).

For starters, this mistake is suppressed by the fact that the ser parameter is declared as type: struct, see sqlite3.ksy:207-210:

  column_content:
    params:
      - id: ser
        type: struct

struct means any user-defined type. This is a problem, because the compiler (correctly) allows passing any user type in there - in this case the type will be vlq_base128_be (sqlite3.ksy:183-189):

      - id: column_serials
        size: len_header_and_len.value - 1
        type: serials
      - id: column_contents
        repeat: expr
        repeat-expr: column_serials.entries.size
        type: column_content(column_serials.entries[_index])

sqlite3.ksy:190-194

  serials:
    seq:
      - id: entries
        type: vlq_base128_be
        repeat: eos

But although the actual type of ser is always vlq_base128_be, as we've just seen, the spec thinks it's serial, which is not (sqlite3.ksy:234-236):

    instances:
      serial_type:
        value: ser.as<serial>

So this is basically guaranteed to fail at runtime. But there's not much Kaitai Struct compiler can do about this - strictly speaking, all operations here are valid and are correctly translated (it's just that the .ksy spec is badly written).

It's much better to require a specific user type when defining the parameter:

   column_content:
     params:
       - id: ser
-        type: struct
+        type: serial

Now, KS compiler will not allow passing the vlq_base128_be type to the ser parameter and will raise a compile error.


But unfortunately, if you declare vlq_base128_be as the parameter type, you'll be allowed to pass column_serials.entries[_index] there (as expected), but I don't think you'll get a compile error because of the ser.as<serial> operation (which can never succeed and the compiler could automatically detect it too and throw a compile error, but sadly that is not implemented):

   column_content:
     params:
       - id: ser
-        type: struct
+        type: vlq_base128_be

KS compiler is very dumb when it comes to type casting - AFAIK it allows absolutely any type cast you write, and doesn't check whether it makes any sense (this is tracked in https://github.com/kaitai-io/kaitai_struct/issues/696). So using type casting may be dangerous if you don't know what you're doing. I recommend using it sparingly and really think about whether it is valid (in many cases, people use it in a way they shouldn't and it causes problems).

milahu commented 1 year ago

It's much better to require a specific user type when defining the parameter:

   column_content:
     params:
       - id: ser
-        type: struct
+        type: serial

Now, KS compiler will not allow passing the vlq_base128_be type to the ser parameter and will raise a compile error.

yes, i also had to patch serials to make this work

   serials:
     seq:
       - id: entries
-        type: vlq_base128_be
+        type: serial
diff sqlite3.ksy ```diff --- a/sqlite3.ksy +++ b/sqlite3.ksy @@ -190,7 +190,7 @@ types: serials: seq: - id: entries - type: vlq_base128_be + type: serial repeat: eos serial: seq: @@ -207,11 +207,11 @@ types: column_content: params: - id: ser - type: struct + type: serial seq: - id: as_int type: - switch-on: serial_type.code.value + switch-on: ser.code.value cases: 1: u1 2: u2 @@ -219,21 +219,18 @@ types: 4: u4 5: b48 6: u8 - if: serial_type.code.value >= 1 and serial_type.code.value <= 6 + if: ser.code.value >= 1 and ser.code.value <= 6 - id: as_float type: f8 - if: serial_type.code.value == 7 + if: ser.code.value == 7 - id: as_blob - size: serial_type.len_content - if: serial_type.is_blob + size: ser.len_content + if: ser.is_blob - id: as_str type: str - size: serial_type.len_content + size: ser.len_content encoding: UTF-8 -# if: _root.text_encoding == encodings::utf_8 and serial_type.is_string - instances: - serial_type: - value: ser.as +# if: _root.text_encoding == encodings::utf_8 and ser.is_string enums: versions: 1: legacy ```

The real issue is that sqlite3.ksy is wrong

this looks like a micro-optimization, trying to defer the evaluation of serial

... but i would need the original IO position of self.ser to read the orignal bytes

implemented in https://github.com/milahu/pysqlite3/tree/fix-typecast-with-io-init-pos

milahu commented 1 year ago

closing in favor of https://github.com/kaitai-io/kaitai_struct_formats/pull/640

generalmimon commented 1 year ago

@milahu:

... but i would need the original IO position of self.ser to read the orignal bytes

implemented in https://github.com/milahu/pysqlite3/tree/fix-typecast-with-io-init-pos

Again, this is not a type cast, this is reparsing the bytes of originally one structure as another structure (but it was never needed, the actual problem was how the sqlite3.ksy was written). I tried to explain it in my last comment (https://github.com/kaitai-io/kaitai_struct/issues/1017#issuecomment-1492988496), I recommend reading it, I wrote it for you.

yes, i also had to patch serials to make this work

   serials:
     seq:
       - id: entries
-        type: vlq_base128_be
+        type: serial

▸ diff sqlite3.ksy

This patch looks quite legit, so why don't you use kaitai-struct-compiler to regenerate the generated sqlite3.py? Then there will be no type cast and no serial_type, so your patch in https://github.com/milahu/pysqlite3/commit/d3aa20c664cc6bd39eb60c92f744e3ca6d8369f9 will also be meaningless. I don't understand how the second half of your comment https://github.com/kaitai-io/kaitai_struct/issues/1017#issuecomment-1493000218 can follow after the first one.