Open webbnh opened 6 years ago
ksc currently provides no guarantees like that, i.e. data in seq
is supposed to be read/written in sequence, thus there is no automatic "deferred processing" here by design. I agree that adding a warning and/or error message on precompile step would be nice, but it might be not very trivial to do so, for example, this is easy:
seq:
- id: a
size: 16
process: xor(b)
- id: b
type: u1
a
uses b
, b
is defined after a
=> problem. This is trickier:
seq:
- id: a
size: 16
process: xor(c)
- id: b
type: u1
instances:
c:
value: b + 1
a
uses c
, which requires b
, which is undefined at time then it will be used. I would assume that there are much trickier ones involving conditions (i.e. a ? b : c
), involving recursion and boiling down to a complexity of halting problem.
That said, I personally never found it to be a big problem, as typically a target language compiler is good in issuing warning about such behavior in generated code (i.e. usage of uninitialized variables).
I think the fact that KSC wraps all accesses to the fields inside nice accessor functions masks the fact that the accesses might be to uninitialized memory. (There's no way, inside the accessor function, to tell that the field is uninitialized, and there's probably no way in the calling function (_read()
) to tell what the accessor function does, without in-lining at the source code level.)
So, I'm willing to buy into the fact that my .ksy
is bad, and I'm willing to grant that it might be hard to detect the use-before-read case, but it might not be hard to defer the processing step, which would solve at least my problem.
The code generated by KSC for me look similar to this:
void records_t::fileheader_t::_read() {
...
m__raw_serial_number = m__io->read_bytes(16);
m_serial_number = kaitai::kstream::process_xor_one(m__raw_serial_number, (xor_key() & 255));
m__raw_hash = m__io->read_bytes(32);
m_hash = kaitai::kstream::process_xor_one(m__raw_hash, (xor_key() & 255));
m_xor_key = m__io->read_u2le();
...
}
At least conceptually, it seems like it would be straightforward for the compiler to emit the calls to process_xor_one()
after emitting the last read call.
In the absence of that compiler feature, how can I specify to KSC that I want the field processed by a value that I haven't read yet? Can I create a type and apply it lazily? (Or, is there some way to do the processing in a value instance?)
Ok, so there are 2 distinct questions:
Second one is easier to answer to in some respects. Existing solutions are:
seq:
- id: placeholder
size: 16 + 32
- id: xor_key
type: u2
instances:
serial_number:
pos: 0
size: 16
process: xor(xor_key & 0xff)
hash:
pos: 16
size: 32
process: xor(xor_key & 0xff)
seq:
- id: serial_number_placeholder
size: 16
if: serial_number_pos >= 0 # always true, but stores pos
- id: hash_placeholder
size: 32
if: hash_pos >= 0 # always true, but stores pos
- id: xor_key
type: u2
instances:
serial_number_pos:
value: _io.pos
serial_number:
pos: serial_number_pos
size: 16
process: xor(xor_key & 0xff)
hash_pos:
value: _io.pos
hash:
pos: hash_pos
size: 32
process: xor(xor_key & 0xff)
And one not-yet-implemented yet lazy seq solution:
seq:
- id: serial_number
process: xor(xor_key & 0xff)
size: 16
lazy: true
- id: hash
process: xor(xor_key & 0xff)
size: 32
lazy: true
- id: xor_key
type: u2
This is obviously not a panacea, but typically wrapping everything possible in lazy: true
will help untangling stuff like that. Obviously it couldn't help with stuff where positions/sizes are not fixed, infinitely recursive stuff, etc.
Commenting some of the questions / proposals:
I think the fact that KSC wraps all accesses to the fields inside nice accessor functions masks the fact that the accesses might be to uninitialized memory.
First of all, KSC does not do it everywhere. For example, in JavaScript and Python, one just accesses raw object's variables using foo.bar
syntax, there are no getter functions involved. In Go, there is actually a painful distinction between seq fields, which are available as foo.Bar
(i.e. just variable access in a structure) and instances, which require foo.Bar()
call, which is not only returning the value of the instance, but also an error code (as any instance might hit an error during parsing).
Second, I don't think that detection of uninitialized memory access is particularly hard in runtime even when it's not done automatically by the language, i.e. we can in theory prepare an extra boolean field for every attribute that will start as false
, and will be set to true
after successful definition being done. Then, a getter for foo
would actually look like:
if self._i_foo:
return self.foo
else:
raise Exception("foo is accessed, but it is yet undefined")
May be it's a good idea to have some sort of flag to generate that kind of "safety" machinery. It will be obviously slower (and less readable), but definitely safer and won't allow that silent corruption you've mentioned.
At least conceptually, it seems like it would be straightforward for the compiler to emit the calls to
process_xor_one()
after emitting the last read call.
No, it's not:
seq:
- id: a
type: foo
size: 16
process: xor(0xaa)
- id: b
size: a.some_attribute
In this case, b
is dependent on a
being fully parsed and available to determine b's size.
I don't think it's completely infeasible to support this in "sane" cases where the graph of "field cross-references" forms a DAG: for each field, you presumably know which other fields its process
function will reference, statically. You can then perform a topological ordering, and then compute the process
functions in that order.
(That said, it's definitely a new feature, I guess)
@FireyFly It is definitely possible, but I just wanted to point you that it's trickier than it looks like. process
is actually the easiest of them all, as it does not really affect processing of the stream beyond the attribute in any way. size
, if
, type
switching, repeat-expr
, repeat-until
— all these guys do affect stream state. This one is unsolvable for reading, but technically fully solvable for writing, for example:
seq:
- id: a
size: b
- id: b
type: u4
And this one looks very similar it, but it is actually totally ok, as sum of size(a)+size(b) is const.
seq:
- id: a
size: 20 + c
- id: b
size: 20 - c
- id: c # actually always starts at constant offset = 20 + c + 20 - c = 40
type: u4
@FireyFly, I think constructing and traversing the DAG would allow the compiler to flag cases which aren't sane (which would have been a boon to me...). However, as @GreyCat has demonstrated, the constraint of having to read the fields in order can conflict with a requirement to process them out of order.
@GreyCat, thanks as always for your prompt attention and cogent responses. I plan to pursue the "Hacky but automatic" solution -- since the fields occur in the middle of my record, have to calculate and maintain the offsets manually is unappealing. And, as for referencing the fields out of order, that is the sort of thing up with which I shall not put. ;-)
(I tried to search for similar issue reports, but failed to find any -- sorry if this is a duplicate.)
I have a
.ksy
file containing the following:The generated code reads the fields in order (which is not surprising), but it processes
serial_number
when it is read, whenxor_key
is still uninitialized.I would have expected either that KSC would have issued a warning/error at compile time or that it would have deferred the processing of
serial_number
. (Replacing the direct reference to thexor_key
expression with an value instance yields the same result.) Having it "silently fail" (i.e., produce bad results at runtime) is...disappointing.Any suggestions on how I should implement this? (I would expect that I should be able to define an value instance for the serial number, but I don't know how to specify a
process
attribute for an instance, and I don't know how to apply an XOR operator to each byte of a byte array....)Thanks!