kaitai-io / kaitai_struct

Kaitai Struct: declarative language to generate binary data parsers in C++ / C# / Go / Java / JavaScript / Lua / Nim / Perl / PHP / Python / Ruby
https://kaitai.io
3.99k stars 194 forks source link

Bit field data types #9

Closed tyilo closed 7 years ago

tyilo commented 8 years ago

Sometimes a file format doesn't use whole bytes for every field, so it would be nice to support bit fields. Something like b[1-64]{le,be}.

GreyCat commented 8 years ago

Yeah, there are already plans for it :)

rpavlik commented 7 years ago

Subscribing to this - was hoping to make a non-spaghetti-code EDID parser using ksy, but it's bitfields galore, as well as crazy string formats (PNPID packs 3 A-Z ascii characters into two bytes)

GreyCat commented 7 years ago

@rpavlik The funny thing is that I was digging through EDID format just a week ago or so. I guess I'll publish what I've got so far, so we won't be duplicating the effort at the very least. Here it is.

ams-tschoening commented 7 years ago

I would like to add the current use case I have to deal with to this discussion, because you might find it interesting when deciding some design. Not sure if it's best described here or in the discussion about alignment in #12, but my data is always aligned at byte boundaries, so I guess it's better placed here, because ultimately I need to access bits of some bytes. I hope it's not toally off-topic...

What I'm talking about are the types DIF and DIFE from the standard EN 13757-3 2012, Communication systems for and remote reading of meters - Part 3: Dedicated application layer, which is sadly not publicly available. Both types are defined as follows:

DIF

Bit    7           6           5 4         3 2 1 0
Value  Extension   LSB of      function    Data field:
       bit (E)     storage     field       length and
                   number                  coding of data
DIFE

Bit    7           6           5 4         3 2 1 0
Value  Extension   (Device)    Tariff      Storage Number
       bit (E)     Subunit

The extension bit of DIF is telling us if DIFE is present and the same bit of DIFE is telling if another DIFE is available, until the bit of the last DIFE is cleared. What I'm interested in is storage number, tariff and subunit. For all those three parts, ALL bits of ALL DIF and DIFEs need to be collected individually to form the resulting values one can work with. Not needed bits of those three subparts are simply 0, so one won't ever get alignment issues even if only one additional DIFE is needed to e.g. completely describe a storage number, but no subunit or tariff. How many DIF and DIFEs are available is not known beforehand, one needs to parse those fields byte by byte in order and check their extension bit.

I see two questions with this: First, one would need to be able to access the needed bits within the bytes at all, like with the suggested array syntax or such. Second is that the accessed bits would need to be collected somehow. This might be something very special for my use case of course and might simply be ignored in KS and be left to some higher level in a user application.

Besides the array syntax approach to access bits, I would like to suggest to think of if it's possible to handle bits the same way like it's currently done with bytes. One would simply understand a bit field as an array of bits with some length, which is iterated bit by bit in order and for which a seq could be defined describing the purpose of each bit or group of bits with some length. Just like one can parse 2 bytes as one type using e.g. u2, one could use b2 to describe that 2 bits are logical forming one type. You have already mentioned this b2 thing somehwere in your other issues... The more important question is if it's possible to understand a byte the same way like you currently understand a byte array and reuse all of your existing syntax like seq, types, instances etc. From my point of view this would feel pretty natural and should be easy to understand to KS users, it is comparable to using a substream, only more detailled/recursive/whatever.

seq:
  - id:   some_byte
    type: u1
  - id:   dif
    type: dif
    bit_field: true
  - id:   dife
    type: dife
    bit_field: true
types:
  dif:
    seq:
      - id:   extension
        type: b1
      - id:   lsb_strg_nr
        type: b1
      - id:   function
        type: b2
      - id:   data_field
        type: b4
  dife:
    seq:
      - id:   extension
        type: b1
      - id:   sub_unit
        type: b1
      - id:   tariff
        type: b2
      - id:   storage_nr
        type: b4

bit_field would tell KS to parse the byte into some structure which individually provides some representation of a bit and everything else is much like what you have currently with _io and reading and such, just a bit different. ;-) In such a case it would be cool if you could provide the originally read byte automatically on the type, just like you do with _raw_* in case of a substream. This way you would implement exactly the same concepts like on byte arrays, only some lower level.

For the collection thing... How about introducing some storage concept, comparable to what you already have with value instances? Value instances have two functions, they can be used to calculate some value and additionally those values are stored in an instance. Sometimes instances itself are only used to read and store some byte out of the normal parsing order of a stream. Maybe it's possible to generalize this approach by providing some syntax focussed on storage itself?

For collecting bits, the storage would simply be a list, like you already do with repeat and this way one wouldn't get any problems with the size of a datatype. For my case, it would need to be defined in the scope of the parent of the bit_field, but you already have the concept of accessing a _parent anyway and this would fit to how enums are resolved as well. In the end one could have an additional keyword per type, which gets some bits and adds those into a named storage.

Something like the following:

seq:
[...]
types:
  dife:
    seq:
      - id:   extension
        type: b1
      - id:   sub_unit
        type: b1
        storage:
          name:  _parent.dife_sub_unit
          value: _
      - id:   tariff
        type: b2
        storage:
          name:  _parent.dife_tariff
          value: _
      - id:   storage_nr
        type: b4
        store-in: _parent.dife_storage_nr
          value: _
storage:
  - dife_sub_unit
  - dife_tariff
  - dife_storage_nr

This is much like switch-on already and value could be either the completely read value using _ or some part of this defined using expression syntax and bit operations. The storage itself is just properties for the current type.

In my case one wouldn't even need to access that storage from within KS, it would simply help to only provide it as property to some higher level of the app. This way I wouldn't need to get all DIF and DIFEs per block of data and collect the needed blocks of data on my own, but KS did already while parsing the data and the only thing I need to do is combining all the collected bits and such. Which in theory could KS support as well in future, if it knows that parsing finished, by automatically adding some property containing the result. Much like you do with _raw_* in case of substreams already.

GreyCat commented 7 years ago

Sorry for long time to reply, but I really wanted to read and understand your whole proposal. Let's start to eat the elephant by cutting it in parts, as they say :)

First thing: one needs to get bits parsed. I totally agree with your proposal that reading bits should be very similar to reading all other full-byte types, i.e. it should be the same seq, types, instances all the way. I'd like to proposal two minor changes to your proposal:

This brings the types you've specified to something like:

seq:
  - id:   some_byte
    type: u1
  - id:   dif
    type: dif
  - id:   dife
    type: dife
    if: dif.extension != 0
    repeat: until
    repeat-until: _.extension != 0
types:
  dif:
    seq:
      - id:   extension
        bits: 1
      - id:   lsb_strg_nr
        bits: 1
      - id:   function
        bits: 2
      - id:   data_field
        bits: 4
  dife:
    seq:
      - id:   extension
        bits: 1
      - id:   sub_unit
        bits: 1
      - id:   tariff
        bits: 2
      - id:   storage_nr
        bits: 4
ams-tschoening commented 7 years ago

b2 shouldn't have meant that b3 isn't possible, b126785 should have been a perfectly valid value to read. I only thought that it might be useful to keep the type syntax and parse the x from the bx as the actual value to read. While this is more work than your approach, because you can easily access the bits to read, there's one more new keyword do be documented and such.

In the end, it's about taste again: bits: x is easier to understand on reading than type: bx, while I find the latter more in line with your current concepts.

rpavlik commented 7 years ago

A fly in the soup, or at least something to keep in mind when designing this, is that in some formats (EDID is one of them, but also many protocols involving embedded systems or sensors), you'll get something like two 12-bit integers x and y packed as follows, for instance (example based on my memory of one of the EDID fields):

For a more pathological (but still "in production") case, see some of the Wii Remote reports (they also have nice tables showing this idea, while I just made a little markdown list): Accelerometer bits being a bit crazy: http://wiibrew.org/wiki/Wiimote#Normal_Accelerometer_Reporting while IR camera tracker being a probably more common case: http://wiibrew.org/wiki/Wiimote#Basic_Mode (see all 3 modes)

ghost commented 7 years ago

I'm sorry for offtopic.

It makes sense to have constants like u1, u2, u4, u8 exactly because there is, for example, no u3, because not a single all-purpose programming language has a weird type like that.

Recently I've reversed some huge blobs... with u3 as length field type. I've even wanted to propose to implement it into KS, but it's really better to do like that: size: (high_byte << 16) | low_word, than create the type for each existent uN in the world.

GreyCat commented 7 years ago

Recently I've reversed some huge blobs... with u3 as length field type. I've even wanted to propose to implement it into KS,

I'm pretty persistent on this one. The only point in having u2, u4, u8 as built-in type in KS is efficiency: 99% of time you can use built-in language primitives to parse those (which, in turn, eventually get translated into quick platform-specific CPU code). There is no u3-like type or parsing builtin in any language that we support now, so parsing it will really boil down to manual shift + binary OR arithmetics in any language. There is no point to have such types builtin: it's just as efficient as doing it already using an instance, and, if you need, you can declare it as a user type:

types:
  u3:
    seq:
      - id: high_byte
        type: u1
      - id: low_word
        type: u2
    instances:
      value:
        value: '(high_byte << 16) | low_word'
GreyCat commented 7 years ago

b2 shouldn't have meant that b3 isn't possible, b126785 should have been a perfectly valid value to read.

Actually, there are two major use cases here. From the implementation point of view, we'd either want:

Shall we have two distinct types for sub-byte parsing?

GreyCat commented 7 years ago

@rpavlik Ok, thanks for the examples. It's clearly not a trivial task, so what I propose is to look at it from implementation side. What will we want to get in generated code? I assume that in any case generated parsing code will have 2 distinct parts:

On the other hand, generated writing (i.e. stream generation) code will mirror these by having parts to:

Thus, I propose to separate these two tasks. Let's discuss only the reading/writing part here, and let's open a separate issue to discuss combining/splitting?

GreyCat commented 7 years ago

Started #58 for split/combine discussion.

ams-tschoening commented 7 years ago

Guten Tag Mikhail Yakshin, am Donnerstag, 1. Dezember 2016 um 10:30 schrieben Sie:

Shall we have two distinct types for sub-byte parsing? type: b1up to type: b64to specify parsing as integer bits: some_expressionto specify parsing as true "bit array"

I like that idea, but just to be sure: Would the first include something like b3, b23 and such or would you only provide useful boundaries of some kind?

Even if reading bits to some upper level of some integer directly into such an integer might be more efficient, in theory it's only a sub case of reading an array of bits and combining the result afterwards. So if it makes anything in the implementation easier, one could start with bits only and map bxy to that and optimize to your mentioned approach afterwards. If... :-)

Mit freundlichen Grüßen,

Thorsten Schöning

-- Thorsten Schöning E-Mail: Thorsten.Schoening@AM-SoFT.de AM-SoFT IT-Systeme http://www.AM-SoFT.de/

Telefon...........05151- 9468- 55 Fax...............05151- 9468- 88 Mobil..............0178-8 9468- 04

AM-SoFT GmbH IT-Systeme, Brandenburger Str. 7c, 31789 Hameln AG Hannover HRB 207 694 - Geschäftsführer: Andreas Muchow

GreyCat commented 7 years ago

Would the first include something like b3, b23 and such or would you only provide useful boundaries of some kind?

I believe it will be parsed using b([1-9][0-9]*) regexp and then checked for resulting length integer to be between 1 and 64 inclusive. Parsing 65 bits and upwards into "standard" integer is pretty much impossible in current environment: Java does not have standard types like that, .NET as well, C++14 standard doesn't even mention the possibility of uint128_t, etc.

So if it makes anything in the implementation easier, one could start with bits only and map bxy to that and optimize to your mentioned approach afterwards. If... :-)

bits is actually somewhat harder to implement. bXX always map to integers clearly, which already exist in every language. For bits, we'll need to:

ghost commented 7 years ago

any clues what shall we use for other languages?

In Perl, the most simplest way that I found to do it is convert bytes to a bit string and then extract desired characters from the string:

my $val_str = "\xAA\xBB\xCC\xDD\xEE\xFF";
my $val_bits = unpack("B*", $val_str); # "101010101011101111001100110111011110111011111111"
my $result_str = substr($val_bits, 40, 8); # "11111111"
my $result_bits = pack("b*", $result_str); # 0xFF number

That can also be used to set specific bits:

my $val_str = "\xAA\xBB\xCC\xDD\xEE\xFF";
my $val_bits = unpack("B*", $val_str); # "101010101011101111001100110111011110111011111111"
substr($val_bits, 40, 8) = "10001000"; # "101010101011101111001100110111011110111010001000"
my $result_str = substr($val_bits, 40, 8); # "10001000"
my $result_bits = pack("b*", $result_str); # 0x88 number
GreyCat commented 7 years ago

@sergeyzelenyuk That's probably not the best way to go. We're talking about several megabytes worth of a bit array, and you need to extract one or two bits from it. I wonder if something like this will work?

ghost commented 7 years ago

I've tried it, doesn't compile. Perl also has built-in vec function to work with byte arrays but, unfortunately, it can work only with 32< bit numbers. Not many options.

ams-tschoening commented 7 years ago

Guten Tag Sergey Zelenyuk, am Donnerstag, 1. Dezember 2016 um 16:51 schrieben Sie:

I've tried it, doesn't compile.

And look at it's source, seems to me it's implementing a comparable approach like yours:

my $bin_str = sprintf("%lb",$obj->{Value}); [...] substr($bin_str,$substr_offset,1)=$bit_value; [...] my $dec_val = oct('0b'.$bin_str);

This approach seems quite common, though:

Bitmask::Data does not store bitmasks as integers internally, but as strings conststing of \0 and \1, hence makinging unlimited length bitmasks possible (32-bit perl can handle integer bitmasks only up to 40 bits).

http://search.cpan.org/~maros/Bitmask-Data-2.04/lib/Bitmask/Data.pm

Perl also has built-in vecfunction to work with byte arrays but, unfortunately, it can work only with 32< bit numbers. Not many options.

Maybe I'm understanding the docs wrong, but isn't the input to vec of arbitrary length, only the containing blocks and their BITS width is limited? And even that limit seems to be 64 bits if one assumes a 64 Bit Perl these days.

http://perldoc.perl.org/functions/vec.html

Doesn't the following look promising?

$x->blsft($n); # left shift $n places in base 2 $x->blsft($n,$b); # left shift $n places in base $b

returns (quo,rem) or quo (scalar context)

$x->brsft($n); # right shift $n places in base 2 $x->brsft($n,$b); # right shift $n places in base $b

returns (quo,rem) or quo (scalar context)

Bitwise methods

$x->band($y); # bitwise and $x->bior($y); # bitwise inclusive or $x->bxor($y); # bitwise exclusive or $x->bnot(); # bitwise not (two's complement)

Object property methods (do not modify the invocand)

$x->sign(); # the sign, either +, - or NaN $x->digit($n); # the nth digit, counting from the right $x->digit(-$n); # the nth digit, counting from the left $x->length(); # return number of digits in number

https://metacpan.org/pod/Math::BigInt https://metacpan.org/pod/bigint

Mit freundlichen Grüßen,

Thorsten Schöning

-- Thorsten Schöning E-Mail: Thorsten.Schoening@AM-SoFT.de AM-SoFT IT-Systeme http://www.AM-SoFT.de/

Telefon...........05151- 9468- 55 Fax...............05151- 9468- 88 Mobil..............0178-8 9468- 04

AM-SoFT GmbH IT-Systeme, Brandenburger Str. 7c, 31789 Hameln AG Hannover HRB 207 694 - Geschäftsführer: Andreas Muchow

GreyCat commented 7 years ago

Yeah, generally if search by "bit array", "bit set" or words like that fails, the next best candidate is various implementation of big integer / big numbers. But anyway, to rank them as "fitting the requirements" or "not fitting", we should probably outline some operations we'd want to do with them first. I see now something like:

Anything else vital that I'm missing so far?

ghost commented 7 years ago

Maybe I'm understanding the docs wrong, but isn't the input to vec of arbitrary length, only the containing blocks and their BITS width is limited? And even that limit seems to be 64 bits if one assumes a 64 Bit Perl these days.

Just tested, it works as you say:

my $val = "\xFF\xFF\xFF\xFF\xFF\xFF\xFF\xFF\x80";
print vec($val, 71, 1); # 1

Don't know how I missed it. Thank you. Then probably we can use this option for Perl.

rpavlik commented 7 years ago

I think those operations handle the example cases I was thinking of (EDID, and embedded computing registers/sensor reports).

From a C++ implementation perspective, it may be useful to have a break between things that can fit in a platform max size uint (then the operations basically reduce down to making masks, complement/invert the masks, applying them with AND, shifting (possibly sign extending if someone's fractional byte is supposed to mean a 2's complement signed int), and OR for assignment) and thus might have a really fast implementation, and things that can't, which might be better suited to use with std::bitset, something akin to boost's dynamic_bitset, or (horror) std::vector<bool>. (Note that I haven't looked into those recently - at least std::bitset where the size is static may actually do this "does it fit in an int" optimization already)

(edit: see http://www.drdobbs.com/the-standard-librarian-bitsets-and-bit-v/184401382 - and it looks like llvm/clang libc++ does use size_t for storage with optimization for things that fit in a single one https://github.com/llvm-mirror/libcxx/blob/master/include/bitset though that doesn't help me as much as I'd like on say, AVR, where I'd really rather not use a longer integer than strictly required 😄

Also see here: http://softwareengineering.stackexchange.com/questions/284160/is-there-any-advantage-to-c-style-bit-manipulation-over-stdbitset - particularly the reference to "SWAR")

Also note that the "getting a slice as an integer" operation is obviously limited by the max integer width of a platform, and languages like C++ with both signed and unsigned types 😁 at least would want bit slices of non-power-of-two width extended to the next smallest such width differently depending on whether that slice is to be interpreted as signed or unsigned (sign extension, basically).

(edit: while it's a little low-level and perhaps not the easiest to read with all the underscores indicating it's implementation, the libc++ code for a "bit reference" might be a useful thing to look at combined with the bitset code, as a way of looking at what operations might be desired. https://github.com/llvm-mirror/libcxx/blob/master/include/__bit_reference Now if only there was a bitset "slice" or slice view already included, in addition to the bit view - but as a runtime design that would be less efficient typically than just copying the subset, while from a codegen perspective extracting slices or changing slices is probably more efficient to do in place... )

Note also that there are larger ints (via vector processing stuff, usually), just not stdint.h typedefs that I'm aware of.

Would anyone want/need a popcount operation? I could only imagine it being useful in reading some obscure format - not uniquely invertible for writing.

Was going to bring up binary-coded decimal (like in USB descriptors), but that's typically a byte (two decimal digits from two hex digits) or more at a time, and even if not (odd number of digits?), as long as you can get some bits to an int, then multiply that int, you can build it from those elementary operations.

GreyCat commented 7 years ago

Ok, I guess we're more or less confident with the proposed interface, let's talk about implementation.

From implementation point of view, I see two very distinct (and probably complimentary) options:

(1) We add more call to runtimes: something like read_bits(n) and read_bitset(n), which read n bits and return result either as integer, or as bitset. This is unaligned reading, so it obviously requires storing some additional state in the KaitaiStream object (i.e. index of current bit or current bitmask). It is easy to implement from compiler's POV, as it's the same one-to-one mapping. Stuff like that:

seq:
  - id: x
    type: b4
  - id: y
    type: b3
  - id: z
    type: b1

would compile to 3 distinct statements:

x = _io.readBits(4);
y = _io.readBits(3);
z = _io.readBits(1);

It's simple, straightforward and it works in general case. However, it's not as optimal for instances-based approach we have right now. Reading structures with big and complex flag fields (like, 10-20-30 flags) is probably much more efficient with current instances approach: we just need to read several bytes, remember then and effectively skip over them, and we'll get to decipher individual flag values only on demand. In real world tasks with big flag fields you rarely need all of them at once, you usually need only 1 or 2, and instances-based approach allows you to do precisely that: read flags as bytes and then extract only a few individual flag values.

So, we come to alternative approach.

(2) We don't touch anything with language runtimes and implement byte-aligned bit reads using existing tools. That means that we're gathering groups of bits on pre-processing stage, and generate flag-extracting instances automatically, i.e. previous example becomes something like that:

private _read() {
    _bf1 = _io.readU1();
}

public int x() {
    if (x != null)
        return x;
    x = (_bf1 & 240) >> 4;
    return x;
}

public int y() {
    if (y != null)
        return y;
    y = (_bf1 & 14) >> 1;
    return y;
}

public int z() {
    if (z != null)
        return z;
    z = _bf1 & 1;
    return z;
}

So, what do you think of these two approaches? I guess we should implement both (probably preferring second approach where it's possible — i.e. when we have fixed bit offsets), but which shall we implement first? Both would be kind of difficult, as approach (1) requires lots of modifications in every runtime, and (2) requires some magic within the compiler to gather the bitfields.

ams-tschoening commented 7 years ago

As approach 2 is surely what most people do right now, I think starting with that is preferable: In case of bugs and changes you don't need to change each and every runtime all the time and as people are already somewhat familiar with this approach, it might make debugging easier for us users and such.

Should the name of the read byte provide a hint to the associated fields? But I guess one would easily run into problems with name length of methods and such in different target languages.

Do you plan to create a getter for _bf1? You have such for raw data in case of substreams and I see possible benefits here as well. Such a getter could be automatically enhanced by docs referring to the fields it belongs to. No problem with name lengths of vars and such at all.

LogicAndTrick commented 7 years ago

I think adding bit-level reading to the runtimes is the more difficult option (since there are a lot of runtimes), so possibly start with option (2) first? I think that more formats are aligned to byte-level (at least ones that I have dealt with) so it's probably the more useful option too.

My experience with sub-byte formats is generally one of:

  1. Bit flags: 8 or 32 bit integer encoding a number of boolean values - always byte aligned
  2. "Lossy" integers: for example, RGB565 compressed colours - however I've always encountered these in byte-aligned blocks (e.g. 565 is 16 bits)
  3. Custom data types (e.g. half precision floats) which are probably unreasonable to support anyway as you would want to use a library class

I would guess that bit-level reads in runtimes would generally take a bit of a performance hit, so you would want to avoid them if possible, I think adding an alternative to the b1 (etc) types that are just syntactic sugar over instances would be nice. Maybe something like:

seq:
  - id: flags
    size: 2 # runtime just needs to read 2 bytes like normal
    # ignore the syntax... just a quick example...
    bitmap:
      - flag_1: b0 # 1 bit long, maps to boolean
      - flag_2: b1
      - something_else: b2-5 # 4 bits long, auto-maps to u1
      - another_idea: b6-11 << 2 # 6 bits long, shifted by 2
        type: s4                 # maps to s4
      # some good way to represent the bits here
      # basically generates instances but a bit more readable
      # compiler doesn't need to do as much magic

I know there are more cases when you get to more complicated/compressed formats that would benefit from bit-level precision, so that's still important... I think sub-byte precision in the runtimes will need something like the alignment options in #12 to be implemented around the same time as well.

GreyCat commented 7 years ago

Should the name of the read byte provide a hint to the associated fields? But I guess one would easily run into problems with name length of methods and such in different target languages.

You mean, generating stuff like _bf_x_y_z? Possible, but probably not very critical. And you're right about running up to name lengths quickly. Even if JVM technically has limit of 64K, but probably dealing with name of ~30 verbose flag names joined in a single line won't be that helpful and useful.

Do you plan to create a getter for _bf1? You have such for raw data in case of substreams and I see possible benefits here as well.

Why not, it's helpful for debugging.

Such a getter could be automatically enhanced by docs referring to the fields it belongs to. No problem with name lengths of vars and such at all.

Yeah, good idea.

I think adding bit-level reading to the runtimes is the more difficult option (since there are a lot of runtimes), so possibly start with option (2) first?

Well, the only catch is that adding "gathering fixed bitfield" mechanics into the compiler is not a piece of cake either. And technically, (1) is a general case and (2) is more like optimization. The generated API is the same — same getters for everything.

Maybe something like:

Um, you know what, ignoring the fact that proposed syntax is really awful ;), I like your idea, maybe not in letter, but in spirit! Technically, we already do substreams and this "fixed bitmap" actually looks exactly like a substream. Why don't we postulate that it must be in a separate type then, something like that:

seq:
  - id: flags
    size: 2
    type: flags
types:
  flags:
    seq:
      - id: flag_1
        type: b1
      - id: flag_2
        type: b1
      - id: something_else
        type: b4
      - id: another_idea
        type: b6

That's something that we can do even without complex fixed field bield substructures detection. On the other hand, it's kind of cheating, eh...

koczkatamas commented 7 years ago

"unfortunately, it seems that there are no bignums or something like that in Python's stdlibs."

Python's number implementation is BigInteger by default AFAIK.

Yeah, generally if search by "bit array", "bit set" or words like that fails, the next best candidate is various implementation of big integer / big numbers.

I think the BigInteger implementations give comparable or better performance than bit set implementations, so they will probably do it.

Can you give me examples where that big bit fields are used which were mentioned (like >= 1024 bits big)?

public int z() {
    if (z != null)
        return z;
    z = _bf1 & 1;
    return z;
}

I think we should map b1 to bool instead of int (at least in languages which support it).

GreyCat commented 7 years ago

Python's number implementation is BigInteger by default AFAIK.

That's true, thanks! Then it will probably work.

I think the BigInteger implementations give comparable or better performance than bit set implementations, so they will probably do it.

Actually, there are quite a few differences. One of the most obvious is that when you want some small (n bits) portion on large bitset (N bits), it's usually O(n) to extract that small portion. When you're dealing with big integers, extracting even one bit is O(N) operation, usually with a huge constant and lots of pointless memory allocations (due to the fact that you need to construct a temporary 1-bit big integer to AND your integer with).

For example, on my laptop the following line of Ruby code:

# Initial setup
x = 1 << 100000000
bit_index = 99999998

# Code to benchmark
x & (1 << bit_index)

does ~98 iterations per second, which is ridiculously slow. If you set x = 1 and bit_index = 0, then it does ~12.3M iterations per second. True bitset (bitarray) implementation performance shouldn't change (at least, change that dramatically) based on amount of bits in the bitset — they're stored as array anyway, it should be O(1) to index them. In Ruby, there is [] operator for that. This:

x[bit_index]

yields ~15M iterations per second. However, I'm not really sure about all big integer implementations. Python seems to lack such an operation :(

Can you give me examples where that big bit fields are used which were mentioned (like >= 1024 bits big)?

For example, 1-bit image stored in a Microsoft's .bmp file is essentially a huge bit array (width * height bits). Lots of games use similar 1-bit arrays to store some mapping data (i.e. cell properties like empty / occupied). Some graph formats use 1-bit arrays to store adjacency matrix. Lots of bloom filter implementations (when they're serialized to the disk file) are based on bit arrays.

I think we should map b1 to bool instead of int (at least in languages which support it).

Hmm, that might be a good idea :) And probably it's better to introduce it early, as it breaks the compatibility.

GreyCat commented 7 years ago

More thoughts and comparisons on "true" bitset implementations.

I'm writing a simple function that is expected to output 1011_0011_1000_1111.

C

C# uses BitArray class for arbitrary length bit arrays.

            // var buf = new byte[] { 0b11001101, 0b11110001 };
            var buf = new byte[] { 0xcd, 0xf1 };
            var ba = new BitArray(buf);

            for (int i = 0; i < ba.Length; i++)
            {
                Console.Write(ba[i] ? '1' : '0');
            }

Note that enumeration of bits goes least significant → most significant, thus we have to reverse the order that bits are normally written out as binary literals.

Java

Basically the same stuff as C#. Bit index 0 to n = least significant to most significant bit.

        byte[] buf = new byte[] { (byte) 0b11001101, (byte) 0b11110001 };
        BitSet bs = BitSet.valueOf(buf);
        for (int i = 0; i < bs.length(); i++) {
            System.out.print(bs.get(i) ? '1' : '0');
        }

C++

C++ bitset, unfortunately, seems to lack public methods to construct it from an array of bytes. Standard constructor takes a C string, but expects it to be a string literal with 0s and 1s (one byte per bit). Judging by GNU C++ STL implementation:

This means that, even if we'll get to push byte array into C++ bitset private member, we'll have to repack it in a very different manner. SO answers suggest that people routinely initialize bitset manually, doing bit-by-bit sets in a loop.

Ruby

Ruby's Bignum, unfortunately, also lack "from byte array" constructor. What's even worse than C++ — it lacks "set a bit" operation. SO answers tend to use shift + sum (and temporary arrays), which is probably O(n^2) at best.

GreyCat commented 7 years ago

I've just committed a proof-of-concept code for Java and Ruby that implements readBitsInt(n) and adds support for type: bX. A very simple test is there too.

Basically, it boils down to stuff like

  - id: bits_a
    type: b1
  - id: bits_b
    type: b3
  - id: bits_c
    type: b4

compiling to

        this.bitsA = this._io.readBitsInt(1);                                                                                                     
        this.bitsB = this._io.readBitsInt(3);                                                                                                     
        this.bitsC = this._io.readBitsInt(4);                                                                                                     

Nothing too fancy so far, but it already works, probably for the majority of the cases ;)

Current Java implementation is not fool-proof - i.e. it will most likely have problems around reading 63- and 64-bit values, and even reading 58 bits while we already have 7 bits in a buffer would most likely result in a problem due to long overflow.

KOLANICH commented 7 years ago

BTW, why do you use b4? I think that we need a more uniform convention where there is no u1, but u8 is for it. I think everyone will be comfortable with writing size of numbers in bits because in stdlibs they are written in bits. What do we get? We get signed data types like s34 for 34 bit signed integer. The KS compiler and runtime should deal with this internally, a programmer should get the smallest type supported by machine directly.

koczkatamas commented 7 years ago

@GreyCat could you add enum support for bit types too?

Currently there is a restriction that only types with the ReadableType trait can be converted to enums:

case numType: IntType with ReadableType => EnumType(classNameToList(enumName), numType)

If I remove the ReadableType restriction from IntType or allow BitType as well (without ReadableType) then it seemingly works in JavaScript, but I am not sure about other languages:

case numType: IntType with ReadableType => EnumType(classNameToList(enumName), numType)
case numType: BitsType => EnumType(classNameToList(enumName), numType)
GreyCat commented 7 years ago

BTW, why do you use b4?

Because we've discussed it, and, I believe, we've got a consensus.

I think that we need a more uniform convention where there is no u1, but u8 is for it. I think everyone will be comfortable with writing size of numbers in bits because in stdlibs they are written in bits.

You're probably referring to something like https://github.com/kaitai-io/kaitai_struct/issues/55#issuecomment-261754743 ? Please don't mix discussions here and there.

GreyCat commented 7 years ago

@GreyCat could you add enum support for bit types too?

Yeah, sure thing ;) Looks like you've got this right, it should work.

koczkatamas commented 7 years ago

Should I remove the ReadableType constraint from IntType or add BitsType as a distinct switch case?

Won't this cause problems in other languages? (I don't understand why there is a restriction in the first place.)

GreyCat commented 7 years ago

Removing ReadableType should be perfectly ok with one catch: we now have special exception for BitsType(1) to be mapped to boolean. So, probably we should add some extra exceptions for that in parseExpr implementations...

GreyCat commented 7 years ago

And, of course, we'll need tests for that first.

koczkatamas commented 7 years ago

Actually I think it would be better to parse b1 as an integer if the enum property is specified. It will be just a two-state enum, but it can be useful sometimes, and otherwise we have to forbid this state (as you mentioned before) which gives the user less options.

I looked at the code which makes the code generation, but I cannot decide which is the best way to modify the code.

My ideas so far:

GreyCat commented 7 years ago

Then I propose that we have the following:

And .ksy then gets parsed like this:

I'll try to add relevant tests, if no one objects...

GreyCat commented 7 years ago

I've just added align_to_byte() method to runtime API: this is a simple method that clears all bit buffers and effectively firmly aligns reading pointer to byte boundary. ksc now generates calls to align_to_byte() properly, there is a new test that checks this behavior. The only language that got align_to_byte() implementation now is Ruby, but I'm gonna add this method to all runtimes soon.

GreyCat commented 7 years ago

I believe this is mostly done. I've split bitset discussion into separate #112. Given that this came out as major feature of 0.6 and no one have complained since, I'm closing this.