Closed tyilo closed 7 years ago
Yeah, there are already plans for it :)
Subscribing to this - was hoping to make a non-spaghetti-code EDID parser using ksy, but it's bitfields galore, as well as crazy string formats (PNPID packs 3 A-Z ascii characters into two bytes)
@rpavlik The funny thing is that I was digging through EDID format just a week ago or so. I guess I'll publish what I've got so far, so we won't be duplicating the effort at the very least. Here it is.
I would like to add the current use case I have to deal with to this discussion, because you might find it interesting when deciding some design. Not sure if it's best described here or in the discussion about alignment in #12, but my data is always aligned at byte boundaries, so I guess it's better placed here, because ultimately I need to access bits of some bytes. I hope it's not toally off-topic...
What I'm talking about are the types DIF
and DIFE
from the standard EN 13757-3 2012, Communication systems for and remote reading of meters - Part 3: Dedicated application layer
, which is sadly not publicly available. Both types are defined as follows:
DIF
Bit 7 6 5 4 3 2 1 0
Value Extension LSB of function Data field:
bit (E) storage field length and
number coding of data
DIFE
Bit 7 6 5 4 3 2 1 0
Value Extension (Device) Tariff Storage Number
bit (E) Subunit
The extension bit of DIF
is telling us if DIFE
is present and the same bit of DIFE
is telling if another DIFE
is available, until the bit of the last DIFE
is cleared. What I'm interested in is storage number, tariff and subunit. For all those three parts, ALL bits of ALL DIF
and DIFE
s need to be collected individually to form the resulting values one can work with. Not needed bits of those three subparts are simply 0
, so one won't ever get alignment issues even if only one additional DIFE
is needed to e.g. completely describe a storage number, but no subunit or tariff. How many DIF
and DIFE
s are available is not known beforehand, one needs to parse those fields byte by byte in order and check their extension bit.
I see two questions with this: First, one would need to be able to access the needed bits within the bytes at all, like with the suggested array syntax or such. Second is that the accessed bits would need to be collected somehow. This might be something very special for my use case of course and might simply be ignored in KS and be left to some higher level in a user application.
Besides the array syntax approach to access bits, I would like to suggest to think of if it's possible to handle bits the same way like it's currently done with bytes. One would simply understand a bit field as an array of bits with some length, which is iterated bit by bit in order and for which a seq
could be defined describing the purpose of each bit or group of bits with some length. Just like one can parse 2 bytes as one type using e.g. u2
, one could use b2
to describe that 2 bits are logical forming one type. You have already mentioned this b2
thing somehwere in your other issues... The more important question is if it's possible to understand a byte the same way like you currently understand a byte array and reuse all of your existing syntax like seq
, types
, instances
etc. From my point of view this would feel pretty natural and should be easy to understand to KS users, it is comparable to using a substream, only more detailled/recursive/whatever.
seq:
- id: some_byte
type: u1
- id: dif
type: dif
bit_field: true
- id: dife
type: dife
bit_field: true
types:
dif:
seq:
- id: extension
type: b1
- id: lsb_strg_nr
type: b1
- id: function
type: b2
- id: data_field
type: b4
dife:
seq:
- id: extension
type: b1
- id: sub_unit
type: b1
- id: tariff
type: b2
- id: storage_nr
type: b4
bit_field
would tell KS to parse the byte into some structure which individually provides some representation of a bit and everything else is much like what you have currently with _io
and reading and such, just a bit different. ;-) In such a case it would be cool if you could provide the originally read byte automatically on the type, just like you do with _raw_*
in case of a substream. This way you would implement exactly the same concepts like on byte arrays, only some lower level.
For the collection thing... How about introducing some storage concept, comparable to what you already have with value instances? Value instances have two functions, they can be used to calculate some value and additionally those values are stored in an instance. Sometimes instances itself are only used to read and store some byte out of the normal parsing order of a stream. Maybe it's possible to generalize this approach by providing some syntax focussed on storage itself?
For collecting bits, the storage would simply be a list, like you already do with repeat
and this way one wouldn't get any problems with the size of a datatype. For my case, it would need to be defined in the scope of the parent of the bit_field
, but you already have the concept of accessing a _parent
anyway and this would fit to how enums
are resolved as well. In the end one could have an additional keyword per type, which gets some bits and adds those into a named storage.
Something like the following:
seq:
[...]
types:
dife:
seq:
- id: extension
type: b1
- id: sub_unit
type: b1
storage:
name: _parent.dife_sub_unit
value: _
- id: tariff
type: b2
storage:
name: _parent.dife_tariff
value: _
- id: storage_nr
type: b4
store-in: _parent.dife_storage_nr
value: _
storage:
- dife_sub_unit
- dife_tariff
- dife_storage_nr
This is much like switch-on
already and value
could be either the completely read value using _
or some part of this defined using expression syntax and bit operations. The storage itself is just properties for the current type.
In my case one wouldn't even need to access that storage from within KS, it would simply help to only provide it as property to some higher level of the app. This way I wouldn't need to get all DIF
and DIFE
s per block of data and collect the needed blocks of data on my own, but KS did already while parsing the data and the only thing I need to do is combining all the collected bits and such. Which in theory could KS support as well in future, if it knows that parsing finished, by automatically adding some property containing the result. Much like you do with _raw_*
in case of substreams already.
Sorry for long time to reply, but I really wanted to read and understand your whole proposal. Let's start to eat the elephant by cutting it in parts, as they say :)
First thing: one needs to get bits parsed. I totally agree with your proposal that reading bits should be very similar to reading all other full-byte types, i.e. it should be the same seq
, types
, instances
all the way. I'd like to proposal two minor changes to your proposal:
bit_field: true
. In future, we'll probably have some unaligned bit parse mode, so we'll need to specify some alignment mode, but: (1) it's probably better be activated on type spec level, not attribute reference level, (2) right now we have neither, so it's all the same.u1
, u2
, u4
, u8
exactly because there is, for example, no u3
, because not a single all-purpose programming language has a weird type like that. It's a bit different with bit integers: it makes sense to read 3-bit integers, 6-bit integers, etc. Sometimes you'd like to read arbitrary number of bits. So, my proposal to have another key, like size:
here, for example, bits: 3
, bits: 7
, bits: some_field + 3
.This brings the types you've specified to something like:
seq:
- id: some_byte
type: u1
- id: dif
type: dif
- id: dife
type: dife
if: dif.extension != 0
repeat: until
repeat-until: _.extension != 0
types:
dif:
seq:
- id: extension
bits: 1
- id: lsb_strg_nr
bits: 1
- id: function
bits: 2
- id: data_field
bits: 4
dife:
seq:
- id: extension
bits: 1
- id: sub_unit
bits: 1
- id: tariff
bits: 2
- id: storage_nr
bits: 4
b2
shouldn't have meant that b3
isn't possible, b126785
should have been a perfectly valid value to read. I only thought that it might be useful to keep the type
syntax and parse the x
from the bx
as the actual value to read. While this is more work than your approach, because you can easily access the bits to read, there's one more new keyword do be documented and such.
In the end, it's about taste again: bits: x
is easier to understand on reading than type: bx
, while I find the latter more in line with your current concepts.
A fly in the soup, or at least something to keep in mind when designing this, is that in some formats (EDID is one of them, but also many protocols involving embedded systems or sensors), you'll get something like two 12-bit integers x
and y
packed as follows, for instance (example based on my memory of one of the EDID fields):
x[0:7]
y[0:7]
x[8:11]
and y[8:11]
- so one nibble (could be the high or low, etc) of each value is held till later and packed with others in a single byte.For a more pathological (but still "in production") case, see some of the Wii Remote reports (they also have nice tables showing this idea, while I just made a little markdown list): Accelerometer bits being a bit crazy: http://wiibrew.org/wiki/Wiimote#Normal_Accelerometer_Reporting while IR camera tracker being a probably more common case: http://wiibrew.org/wiki/Wiimote#Basic_Mode (see all 3 modes)
I'm sorry for offtopic.
It makes sense to have constants like u1, u2, u4, u8 exactly because there is, for example, no u3, because not a single all-purpose programming language has a weird type like that.
Recently I've reversed some huge blobs... with u3
as length field type. I've even wanted to propose to implement it into KS, but it's really better to do like that: size: (high_byte << 16) | low_word
, than create the type for each existent uN
in the world.
Recently I've reversed some huge blobs... with u3 as length field type. I've even wanted to propose to implement it into KS,
I'm pretty persistent on this one. The only point in having u2
, u4
, u8
as built-in type in KS is efficiency: 99% of time you can use built-in language primitives to parse those (which, in turn, eventually get translated into quick platform-specific CPU code). There is no u3
-like type or parsing builtin in any language that we support now, so parsing it will really boil down to manual shift + binary OR arithmetics in any language. There is no point to have such types builtin: it's just as efficient as doing it already using an instance, and, if you need, you can declare it as a user type:
types:
u3:
seq:
- id: high_byte
type: u1
- id: low_word
type: u2
instances:
value:
value: '(high_byte << 16) | low_word'
b2
shouldn't have meant thatb3
isn't possible,b126785
should have been a perfectly valid value to read.
Actually, there are two major use cases here. From the implementation point of view, we'd either want:
Shall we have two distinct types for sub-byte parsing?
type: b1
up to type: b64
to specify parsing as integerbits: some_expression
to specify parsing as true "bit array"@rpavlik Ok, thanks for the examples. It's clearly not a trivial task, so what I propose is to look at it from implementation side. What will we want to get in generated code? I assume that in any case generated parsing code will have 2 distinct parts:
On the other hand, generated writing (i.e. stream generation) code will mirror these by having parts to:
Thus, I propose to separate these two tasks. Let's discuss only the reading/writing part here, and let's open a separate issue to discuss combining/splitting?
Started #58 for split/combine discussion.
Guten Tag Mikhail Yakshin, am Donnerstag, 1. Dezember 2016 um 10:30 schrieben Sie:
Shall we have two distinct types for sub-byte parsing? type: b1up to type: b64to specify parsing as integer bits: some_expressionto specify parsing as true "bit array"
I like that idea, but just to be sure: Would the first include
something like b3
, b23
and such or would you only provide useful
boundaries of some kind?
Even if reading bits to some upper level of some integer directly into
such an integer might be more efficient, in theory it's only a sub
case of reading an array of bits and combining the result afterwards.
So if it makes anything in the implementation easier, one could start
with bits
only and map bxy
to that and optimize to your mentioned
approach afterwards. If... :-)
Mit freundlichen Grüßen,
Thorsten Schöning
-- Thorsten Schöning E-Mail: Thorsten.Schoening@AM-SoFT.de AM-SoFT IT-Systeme http://www.AM-SoFT.de/
Telefon...........05151- 9468- 55 Fax...............05151- 9468- 88 Mobil..............0178-8 9468- 04
AM-SoFT GmbH IT-Systeme, Brandenburger Str. 7c, 31789 Hameln AG Hannover HRB 207 694 - Geschäftsführer: Andreas Muchow
Would the first include something like
b3
,b23
and such or would you only provide useful boundaries of some kind?
I believe it will be parsed using b([1-9][0-9]*)
regexp and then checked for resulting length integer to be between 1 and 64 inclusive. Parsing 65 bits and upwards into "standard" integer is pretty much impossible in current environment: Java does not have standard types like that, .NET as well, C++14 standard doesn't even mention the possibility of uint128_t
, etc.
So if it makes anything in the implementation easier, one could start with
bits
only and mapbxy
to that and optimize to your mentioned approach afterwards. If... :-)
bits
is actually somewhat harder to implement. bXX
always map to integers clearly, which already exist in every language. For bits
, we'll need to:
any clues what shall we use for other languages?
In Perl, the most simplest way that I found to do it is convert bytes to a bit string and then extract desired characters from the string:
my $val_str = "\xAA\xBB\xCC\xDD\xEE\xFF";
my $val_bits = unpack("B*", $val_str); # "101010101011101111001100110111011110111011111111"
my $result_str = substr($val_bits, 40, 8); # "11111111"
my $result_bits = pack("b*", $result_str); # 0xFF number
That can also be used to set specific bits:
my $val_str = "\xAA\xBB\xCC\xDD\xEE\xFF";
my $val_bits = unpack("B*", $val_str); # "101010101011101111001100110111011110111011111111"
substr($val_bits, 40, 8) = "10001000"; # "101010101011101111001100110111011110111010001000"
my $result_str = substr($val_bits, 40, 8); # "10001000"
my $result_bits = pack("b*", $result_str); # 0x88 number
@sergeyzelenyuk That's probably not the best way to go. We're talking about several megabytes worth of a bit array, and you need to extract one or two bits from it. I wonder if something like this will work?
I've tried it, doesn't compile. Perl also has built-in vec
function to work with byte arrays but, unfortunately, it can work only with 32< bit numbers. Not many options.
Guten Tag Sergey Zelenyuk, am Donnerstag, 1. Dezember 2016 um 16:51 schrieben Sie:
I've tried it, doesn't compile.
And look at it's source, seems to me it's implementing a comparable approach like yours:
my $bin_str = sprintf("%lb",$obj->{Value}); [...] substr($bin_str,$substr_offset,1)=$bit_value; [...] my $dec_val = oct('0b'.$bin_str);
This approach seems quite common, though:
Bitmask::Data does not store bitmasks as integers internally, but as strings conststing of \0 and \1, hence makinging unlimited length bitmasks possible (32-bit perl can handle integer bitmasks only up to 40 bits).
http://search.cpan.org/~maros/Bitmask-Data-2.04/lib/Bitmask/Data.pm
Perl also has built-in vecfunction to work with byte arrays but, unfortunately, it can work only with 32< bit numbers. Not many options.
Maybe I'm understanding the docs wrong, but isn't the input to vec
of arbitrary length, only the containing blocks and their BITS width
is limited? And even that limit seems to be 64 bits if one assumes a
64 Bit Perl these days.
http://perldoc.perl.org/functions/vec.html
Doesn't the following look promising?
$x->blsft($n); # left shift $n places in base 2 $x->blsft($n,$b); # left shift $n places in base $b
returns (quo,rem) or quo (scalar context)
$x->brsft($n); # right shift $n places in base 2 $x->brsft($n,$b); # right shift $n places in base $b
returns (quo,rem) or quo (scalar context)
Bitwise methods
$x->band($y); # bitwise and $x->bior($y); # bitwise inclusive or $x->bxor($y); # bitwise exclusive or $x->bnot(); # bitwise not (two's complement)
Object property methods (do not modify the invocand)
$x->sign(); # the sign, either +, - or NaN $x->digit($n); # the nth digit, counting from the right $x->digit(-$n); # the nth digit, counting from the left $x->length(); # return number of digits in number
https://metacpan.org/pod/Math::BigInt https://metacpan.org/pod/bigint
Mit freundlichen Grüßen,
Thorsten Schöning
-- Thorsten Schöning E-Mail: Thorsten.Schoening@AM-SoFT.de AM-SoFT IT-Systeme http://www.AM-SoFT.de/
Telefon...........05151- 9468- 55 Fax...............05151- 9468- 88 Mobil..............0178-8 9468- 04
AM-SoFT GmbH IT-Systeme, Brandenburger Str. 7c, 31789 Hameln AG Hannover HRB 207 694 - Geschäftsführer: Andreas Muchow
Yeah, generally if search by "bit array", "bit set" or words like that fails, the next best candidate is various implementation of big integer / big numbers. But anyway, to rank them as "fitting the requirements" or "not fitting", we should probably outline some operations we'd want to do with them first. I see now something like:
Anything else vital that I'm missing so far?
Maybe I'm understanding the docs wrong, but isn't the input to
vec
of arbitrary length, only the containing blocks and their BITS width is limited? And even that limit seems to be 64 bits if one assumes a 64 Bit Perl these days.
Just tested, it works as you say:
my $val = "\xFF\xFF\xFF\xFF\xFF\xFF\xFF\xFF\x80";
print vec($val, 71, 1); # 1
Don't know how I missed it. Thank you. Then probably we can use this option for Perl.
I think those operations handle the example cases I was thinking of (EDID, and embedded computing registers/sensor reports).
From a C++ implementation perspective, it may be useful to have a break between things that can fit in a platform max size uint (then the operations basically reduce down to making masks, complement/invert the masks, applying them with AND, shifting (possibly sign extending if someone's fractional byte is supposed to mean a 2's complement signed int), and OR for assignment) and thus might have a really fast implementation, and things that can't, which might be better suited to use with std::bitset
, something akin to boost's dynamic_bitset
, or (horror) std::vector<bool>
. (Note that I haven't looked into those recently - at least std::bitset
where the size is static may actually do this "does it fit in an int" optimization already)
(edit: see http://www.drdobbs.com/the-standard-librarian-bitsets-and-bit-v/184401382 - and it looks like llvm/clang libc++ does use size_t
for storage with optimization for things that fit in a single one https://github.com/llvm-mirror/libcxx/blob/master/include/bitset though that doesn't help me as much as I'd like on say, AVR, where I'd really rather not use a longer integer than strictly required 😄
Also see here: http://softwareengineering.stackexchange.com/questions/284160/is-there-any-advantage-to-c-style-bit-manipulation-over-stdbitset - particularly the reference to "SWAR")
Also note that the "getting a slice as an integer" operation is obviously limited by the max integer width of a platform, and languages like C++ with both signed and unsigned types 😁 at least would want bit slices of non-power-of-two width extended to the next smallest such width differently depending on whether that slice is to be interpreted as signed or unsigned (sign extension, basically).
(edit: while it's a little low-level and perhaps not the easiest to read with all the underscores indicating it's implementation, the libc++ code for a "bit reference" might be a useful thing to look at combined with the bitset code, as a way of looking at what operations might be desired. https://github.com/llvm-mirror/libcxx/blob/master/include/__bit_reference Now if only there was a bitset "slice" or slice view already included, in addition to the bit view - but as a runtime design that would be less efficient typically than just copying the subset, while from a codegen perspective extracting slices or changing slices is probably more efficient to do in place... )
Note also that there are larger ints (via vector processing stuff, usually), just not stdint.h typedefs that I'm aware of.
Would anyone want/need a popcount operation? I could only imagine it being useful in reading some obscure format - not uniquely invertible for writing.
Was going to bring up binary-coded decimal (like in USB descriptors), but that's typically a byte (two decimal digits from two hex digits) or more at a time, and even if not (odd number of digits?), as long as you can get some bits to an int, then multiply that int, you can build it from those elementary operations.
Ok, I guess we're more or less confident with the proposed interface, let's talk about implementation.
From implementation point of view, I see two very distinct (and probably complimentary) options:
(1) We add more call to runtimes: something like read_bits(n)
and read_bitset(n)
, which read n
bits and return result either as integer, or as bitset. This is unaligned reading, so it obviously requires storing some additional state in the KaitaiStream object (i.e. index of current bit or current bitmask). It is easy to implement from compiler's POV, as it's the same one-to-one mapping. Stuff like that:
seq:
- id: x
type: b4
- id: y
type: b3
- id: z
type: b1
would compile to 3 distinct statements:
x = _io.readBits(4);
y = _io.readBits(3);
z = _io.readBits(1);
It's simple, straightforward and it works in general case. However, it's not as optimal for instances-based approach we have right now. Reading structures with big and complex flag fields (like, 10-20-30 flags) is probably much more efficient with current instances approach: we just need to read several bytes, remember then and effectively skip over them, and we'll get to decipher individual flag values only on demand. In real world tasks with big flag fields you rarely need all of them at once, you usually need only 1 or 2, and instances-based approach allows you to do precisely that: read flags as bytes and then extract only a few individual flag values.
So, we come to alternative approach.
(2) We don't touch anything with language runtimes and implement byte-aligned bit reads using existing tools. That means that we're gathering groups of bits on pre-processing stage, and generate flag-extracting instances automatically, i.e. previous example becomes something like that:
private _read() {
_bf1 = _io.readU1();
}
public int x() {
if (x != null)
return x;
x = (_bf1 & 240) >> 4;
return x;
}
public int y() {
if (y != null)
return y;
y = (_bf1 & 14) >> 1;
return y;
}
public int z() {
if (z != null)
return z;
z = _bf1 & 1;
return z;
}
So, what do you think of these two approaches? I guess we should implement both (probably preferring second approach where it's possible — i.e. when we have fixed bit offsets), but which shall we implement first? Both would be kind of difficult, as approach (1) requires lots of modifications in every runtime, and (2) requires some magic within the compiler to gather the bitfields.
As approach 2 is surely what most people do right now, I think starting with that is preferable: In case of bugs and changes you don't need to change each and every runtime all the time and as people are already somewhat familiar with this approach, it might make debugging easier for us users and such.
Should the name of the read byte provide a hint to the associated fields? But I guess one would easily run into problems with name length of methods and such in different target languages.
Do you plan to create a getter for _bf1
? You have such for raw data in case of substreams and I see possible benefits here as well. Such a getter could be automatically enhanced by docs referring to the fields it belongs to. No problem with name lengths of vars and such at all.
I think adding bit-level reading to the runtimes is the more difficult option (since there are a lot of runtimes), so possibly start with option (2) first? I think that more formats are aligned to byte-level (at least ones that I have dealt with) so it's probably the more useful option too.
My experience with sub-byte formats is generally one of:
I would guess that bit-level reads in runtimes would generally take a bit of a performance hit, so you would want to avoid them if possible, I think adding an alternative to the b1
(etc) types that are just syntactic sugar over instances would be nice. Maybe something like:
seq:
- id: flags
size: 2 # runtime just needs to read 2 bytes like normal
# ignore the syntax... just a quick example...
bitmap:
- flag_1: b0 # 1 bit long, maps to boolean
- flag_2: b1
- something_else: b2-5 # 4 bits long, auto-maps to u1
- another_idea: b6-11 << 2 # 6 bits long, shifted by 2
type: s4 # maps to s4
# some good way to represent the bits here
# basically generates instances but a bit more readable
# compiler doesn't need to do as much magic
I know there are more cases when you get to more complicated/compressed formats that would benefit from bit-level precision, so that's still important... I think sub-byte precision in the runtimes will need something like the alignment options in #12 to be implemented around the same time as well.
Should the name of the read byte provide a hint to the associated fields? But I guess one would easily run into problems with name length of methods and such in different target languages.
You mean, generating stuff like _bf_x_y_z
? Possible, but probably not very critical. And you're right about running up to name lengths quickly. Even if JVM technically has limit of 64K, but probably dealing with name of ~30 verbose flag names joined in a single line won't be that helpful and useful.
Do you plan to create a getter for
_bf1
? You have such for raw data in case of substreams and I see possible benefits here as well.
Why not, it's helpful for debugging.
Such a getter could be automatically enhanced by docs referring to the fields it belongs to. No problem with name lengths of vars and such at all.
Yeah, good idea.
I think adding bit-level reading to the runtimes is the more difficult option (since there are a lot of runtimes), so possibly start with option (2) first?
Well, the only catch is that adding "gathering fixed bitfield" mechanics into the compiler is not a piece of cake either. And technically, (1) is a general case and (2) is more like optimization. The generated API is the same — same getters for everything.
Maybe something like:
Um, you know what, ignoring the fact that proposed syntax is really awful ;), I like your idea, maybe not in letter, but in spirit! Technically, we already do substreams and this "fixed bitmap" actually looks exactly like a substream. Why don't we postulate that it must be in a separate type then, something like that:
seq:
- id: flags
size: 2
type: flags
types:
flags:
seq:
- id: flag_1
type: b1
- id: flag_2
type: b1
- id: something_else
type: b4
- id: another_idea
type: b6
That's something that we can do even without complex fixed field bield substructures detection. On the other hand, it's kind of cheating, eh...
"unfortunately, it seems that there are no bignums or something like that in Python's stdlibs."
Python's number implementation is BigInteger by default AFAIK.
Yeah, generally if search by "bit array", "bit set" or words like that fails, the next best candidate is various implementation of big integer / big numbers.
I think the BigInteger implementations give comparable or better performance than bit set implementations, so they will probably do it.
Can you give me examples where that big bit fields are used which were mentioned (like >= 1024 bits big)?
public int z() {
if (z != null)
return z;
z = _bf1 & 1;
return z;
}
I think we should map b1
to bool
instead of int
(at least in languages which support it).
Python's number implementation is BigInteger by default AFAIK.
That's true, thanks! Then it will probably work.
I think the BigInteger implementations give comparable or better performance than bit set implementations, so they will probably do it.
Actually, there are quite a few differences. One of the most obvious is that when you want some small (n bits) portion on large bitset (N bits), it's usually O(n) to extract that small portion. When you're dealing with big integers, extracting even one bit is O(N) operation, usually with a huge constant and lots of pointless memory allocations (due to the fact that you need to construct a temporary 1-bit big integer to AND your integer with).
For example, on my laptop the following line of Ruby code:
# Initial setup
x = 1 << 100000000
bit_index = 99999998
# Code to benchmark
x & (1 << bit_index)
does ~98 iterations per second, which is ridiculously slow. If you set x = 1
and bit_index = 0
, then it does ~12.3M iterations per second. True bitset (bitarray) implementation performance shouldn't change (at least, change that dramatically) based on amount of bits in the bitset — they're stored as array anyway, it should be O(1) to index them. In Ruby, there is []
operator for that. This:
x[bit_index]
yields ~15M iterations per second. However, I'm not really sure about all big integer implementations. Python seems to lack such an operation :(
Can you give me examples where that big bit fields are used which were mentioned (like >= 1024 bits big)?
For example, 1-bit image stored in a Microsoft's .bmp file is essentially a huge bit array (width * height bits). Lots of games use similar 1-bit arrays to store some mapping data (i.e. cell properties like empty / occupied). Some graph formats use 1-bit arrays to store adjacency matrix. Lots of bloom filter implementations (when they're serialized to the disk file) are based on bit arrays.
I think we should map b1 to bool instead of int (at least in languages which support it).
Hmm, that might be a good idea :) And probably it's better to introduce it early, as it breaks the compatibility.
More thoughts and comparisons on "true" bitset implementations.
I'm writing a simple function that is expected to output 1011_0011_1000_1111
.
C# uses BitArray class for arbitrary length bit arrays.
// var buf = new byte[] { 0b11001101, 0b11110001 };
var buf = new byte[] { 0xcd, 0xf1 };
var ba = new BitArray(buf);
for (int i = 0; i < ba.Length; i++)
{
Console.Write(ba[i] ? '1' : '0');
}
Note that enumeration of bits goes least significant → most significant, thus we have to reverse the order that bits are normally written out as binary literals.
Basically the same stuff as C#. Bit index 0 to n = least significant to most significant bit.
byte[] buf = new byte[] { (byte) 0b11001101, (byte) 0b11110001 };
BitSet bs = BitSet.valueOf(buf);
for (int i = 0; i < bs.length(); i++) {
System.out.print(bs.get(i) ? '1' : '0');
}
C++ bitset, unfortunately, seems to lack public methods to construct it from an array of bytes. Standard constructor takes a C string, but expects it to be a string literal with 0s and 1s (one byte per bit). Judging by GNU C++ STL implementation:
_M_w
_M_getword
helper internally_S_whichword
to find a word in _M_w
array__pos / _GLIBCXX_BITSET_BITS_PER_WORD
This means that, even if we'll get to push byte array into C++ bitset private member, we'll have to repack it in a very different manner. SO answers suggest that people routinely initialize bitset manually, doing bit-by-bit sets in a loop.
Ruby's Bignum, unfortunately, also lack "from byte array" constructor. What's even worse than C++ — it lacks "set a bit" operation. SO answers tend to use shift + sum (and temporary arrays), which is probably O(n^2) at best.
I've just committed a proof-of-concept code for Java and Ruby that implements readBitsInt(n)
and adds support for type: bX
. A very simple test is there too.
Basically, it boils down to stuff like
- id: bits_a
type: b1
- id: bits_b
type: b3
- id: bits_c
type: b4
compiling to
this.bitsA = this._io.readBitsInt(1);
this.bitsB = this._io.readBitsInt(3);
this.bitsC = this._io.readBitsInt(4);
Nothing too fancy so far, but it already works, probably for the majority of the cases ;)
Current Java implementation is not fool-proof - i.e. it will most likely have problems around reading 63- and 64-bit values, and even reading 58 bits while we already have 7 bits in a buffer would most likely result in a problem due to long
overflow.
BTW, why do you use b4
? I think that we need a more uniform convention where there is no u1
, but u8
is for it. I think everyone will be comfortable with writing size of numbers in bits because in stdlibs they are written in bits. What do we get? We get signed data types like s34
for 34 bit signed integer. The KS compiler and runtime should deal with this internally, a programmer should get the smallest type supported by machine directly.
@GreyCat could you add enum support for bit types too?
Currently there is a restriction that only types with the ReadableType trait can be converted to enums:
case numType: IntType with ReadableType => EnumType(classNameToList(enumName), numType)
If I remove the ReadableType restriction from IntType or allow BitType as well (without ReadableType) then it seemingly works in JavaScript, but I am not sure about other languages:
case numType: IntType with ReadableType => EnumType(classNameToList(enumName), numType)
case numType: BitsType => EnumType(classNameToList(enumName), numType)
BTW, why do you use
b4
?
Because we've discussed it, and, I believe, we've got a consensus.
I think that we need a more uniform convention where there is no u1, but u8 is for it. I think everyone will be comfortable with writing size of numbers in bits because in stdlibs they are written in bits.
You're probably referring to something like https://github.com/kaitai-io/kaitai_struct/issues/55#issuecomment-261754743 ? Please don't mix discussions here and there.
@GreyCat could you add enum support for bit types too?
Yeah, sure thing ;) Looks like you've got this right, it should work.
Should I remove the ReadableType constraint from IntType or add BitsType as a distinct switch case?
Won't this cause problems in other languages? (I don't understand why there is a restriction in the first place.)
Removing ReadableType
should be perfectly ok with one catch: we now have special exception for BitsType(1)
to be mapped to boolean. So, probably we should add some extra exceptions for that in parseExpr
implementations...
And, of course, we'll need tests for that first.
Actually I think it would be better to parse b1
as an integer if the enum
property is specified. It will be just a two-state enum, but it can be useful sometimes, and otherwise we have to forbid this state (as you mentioned before) which gives the user less options.
I looked at the code which makes the code generation, but I cannot decide which is the best way to modify the code.
My ideas so far:
isEnum: boolean
argument to parseExpr
dataType
to EnumType(BitsType(1))
or to a custom type (ex. OneBitAsEnum
)Then I propose that we have the following:
And .ksy then gets parsed like this:
{"type": "b1"}
=> BitsTypeBool{"type": "b1", "enum": "foo"}
=> EnumType(..., BitsType(1)){"type": "b3"}
=> BitsType(3){"type": "b3", "enum": "foo"}
=> EnumType(..., BitsType(3))I'll try to add relevant tests, if no one objects...
I've just added align_to_byte()
method to runtime API: this is a simple method that clears all bit buffers and effectively firmly aligns reading pointer to byte boundary. ksc now generates calls to align_to_byte()
properly, there is a new test that checks this behavior. The only language that got align_to_byte()
implementation now is Ruby, but I'm gonna add this method to all runtimes soon.
I believe this is mostly done. I've split bitset discussion into separate #112. Given that this came out as major feature of 0.6 and no one have complained since, I'm closing this.
Sometimes a file format doesn't use whole bytes for every field, so it would be nice to support bit fields. Something like
b[1-64]{le,be}
.