Open KOLANICH opened 6 years ago
To be frank, I don't understand majority of this proposal.
KSC generates incorrect code for them assumming certain structure of their modules.
In respect to custom processing calls, ksc generates code that assumes you would provide implementation of that custom processor. If you're re-using existing implementation of LZ4, it's just natural that ksc would have no idea about internals of particular implementation, and you would need to provide a wrapper that would translate ksc encode(...)
and decode(...)
calls into relevant stuff in your existing implementation.
For JS target the situation is much worse. KSC fails to compile the ksy for this targets.
What do you mean by "much worse"? Any examples?
What do you mean by "much worse"?
Generates no code and throws an error. At least in WebIDE.
Any examples?
https://github.com/kaitai-io/kaitai_struct_formats/pull/97
In respect to custom processing calls, ksc generates code that assumes you would provide implementation of that custom processor.
For example for python that code imports packages. It imports packages having the same names stdlib -ackages have. I guess this should be done another way.
1 runtime should provide an abstract base class / interface 2 we should create an own class basing on it and register it in runtime
this approach can be used in interpretable languages
With which modifications this approach can be used in the langs like c++ and rust I have described in the top post.
kaitai-io/kaitai_struct_formats#97
Ok, so you've introduced several custom processing formats. Then you're supposed to create a class that will implement every custom processing format, implementing the same kind of interface (i.e. CustomDecoder
and/or CustomEncoder
). In case of formats like lz4, you won't want to reimplement everything from scratch, so your implementation would be just a wrapper to address some existing library (i.e. for Python or for Java, etc).
@KOLANICH I actually went ahead and thought that it would be a good moment to start an "official" collection of processing routines for compression, so here is proof of concept:
https://github.com/kaitai-io/kaitai_compress
This is how you invoke a LZ4 algorithm from this library:
https://github.com/kaitai-io/kaitai_compress/blob/master/_test/ksy/test_lz4.ksy#L6
and here's the actual wrapper "implementation" in Python:
https://github.com/kaitai-io/kaitai_compress/blob/master/python/kaitai/compress/lz4.py
1 https://github.com/KOLANICH/kaitai_compress/blob/fixes/python/kaitai/compress/__init__.py a) I guess it should be called processor because processing is not limited to compression b) 3 methods, one processed, another one inverts processing if it is possible, another ones gets arguments from binarystream. 2 latest are needed for serialization. Also it'd be nice to have inverse processing operations.
2 https://github.com/KOLANICH/kaitai_compress/blob/fixes/python/kaitai/compress/lz4.py
As I have said, we shouldn't use that approach. Initialization may be costly. So we create an object first and then reuse it.
Also note that imports are in functions, they are not imported if a function is never called.
Unfortunately this gives a small performance overhead: on import python che ks if a module is already imported.
@KOLANICH You just keep banging on an open door. This things are already implemented in publically released v0.8. If you want to change something, please at least take a look at what's already done.
"KaitaiProcessor" that you propose is actually already implemented as 2 interfaces: "CustomDecoder" (which brings decode
method) and "CustomEncoder" (which brings encode
method). Stateful initialization exists and one can define arbitrary set of arguments as well — however, I'm not sure if it makes sense to do that for LZ4. If we're going to do that, of course, this set of arguments need to be available for implementations in all languages.
already implemented as 2 interfaces: "CustomDecoder" (which brings decode method) and "CustomEncoder" (which brings encode method).
Thank you for the info.
1 I only found CustomDecoder
in Java and C# runtimes. CustomEncoder
is not present anywhere in the org.
2 I guess we need to redesign the interface. I gonna do some prototyping first, but I think we need the following features:
Before I begin, I guess I should point to the fact the we have documentation for that, actually.
1 I only found
CustomDecoder
in Java and C# runtimes.
There's also custom_decoder for C++, but generally, right, because they actually only make sense for statically typed languages.
CustomEncoder
is not present anywhere in the org.
Correct, because we don't have #27 which could have used them :(
2 I guess we need to redesign the interface.
The short answer is "No, we do not, at least not at this point".
The longer answer is that in order to do a better interface, one needs to start with how it's being used and answer the question, "what would be a better interface". Here's how it is used now:
// Simple version: we want just byte array
this._raw_buf = this._io.readBytes(50);
MyCustomProcessor _process__raw_buf = new MyCustomProcessor(key());
this.buf = _process__raw_buf.decode(this._raw_buf);
// More complex version: we want user data type in its own IO stream
this._raw__raw_buf = this._io.readBytes(50);
MyCustom _process__raw__raw_buf = new MyCustom(5);
this._raw_buf = _process__raw__raw_buf.decode(this._raw__raw_buf);
KaitaiStream _io__raw_buf = new ByteBufferKaitaiStream(_raw_buf);
this.buf = new Bar(_io__raw_buf, this, _root);
So, how can we make it better? We just need interface that will get bounded IO stream for input and return us another (decrypted, decompressed) stream for output, i.e.:
class MyCustomProcessorStream extends KaitaiStream {
// ...
}
// Gets byte array
BoundKaitaiStream ioSrc = this._io.substream(50);
MyCustomProcessorStream ioDest = new MyCustomProcessorStream(ioSrc, key());
this.buf = ioDest.readBytesFull(); // KaitaiStream method!
// Gets user data type
BoundKaitaiStream ioSrc = this._io.substream(50);
MyCustomProcessorStream ioDest = new MyCustomProcessorStream(ioSrc, key());
this.buf = new Bar(ioDest, this, _root);
This allows on-the-fly decodings, avoids the problem of gulping whole stream into memory at once, etc. However, this is generally much harder to implement:
decode()
method that gets bytearray and returns bytearray, one need to do several dozens of methods like readU1()
, readS1()
, etc.On the other hand, simple "bytes in - bytes out" interface was much easier to implement, and what's more important, if and when we'll introduce a better interface, it would be easy to maintain backwards compatibility by wrapping CustomDecoder / CustomEncoder into a stream as it's done now.
On the other hand, simple "bytes in - bytes out" interface was much easier to implement, and what's more important, if and when we'll introduce a better interface, it would be easy to maintain backwards compatibility by wrapping CustomDecoder / CustomEncoder into a stream as it's done now.
That;s what I suggest to do.
1 KSC desides if the object can/should be static. It is if the params are known in compile time. 2 it generates the code
fac = Processor(params)
3 when the data should be decoded it spawns a context
ctx = fac(data)
4 using this context the decoded data can be accessed
ctx[start:stop]
One needs to be able to seek through resulting stream, which would be pretty hard to do for most compression algorithms (i.e. to seek to N-th uncompressed byte, you would still need to decode everything before N-th).
That's why we need context. We can do differrent things there.
ctx[0:block_size]=H(data), ctx[i*block_size, (i+1)*block_size ]=H(ctx[(i-1)*block_size, i*block_size])
is pretty unseekable, but we can cache some previous pairs (i, ctx[i*block_size, (i+1)*block_size ])
and assume some locality.
I have implemented a part of cso file format.
It uses 2 different compressions currently unsupported: deflate and lz4. KSC generates incorrect code for them assumming certain structure of their modules. These assumptions are simply wrong. Fortunately the manual fixes are rather small.
For JS target the situation is much worse. KSC fails to compile the ksy for this targets.
The proposal is to shift the knowledge about processors into runtime entirely. Each processor is identified by a string. This string is mapped to a class which is a part of runtime and which gets arbitrary number of arguments. KSC has no assumptions about the class internal structure but knows about its interface.
This construct works nicely for interpreted languages like JS, python and lua since they can import dependencies in runtime. It can work also for Java and C# via reflection.
All the needed imports can be done within the ctor, so if the dependency is not needed, it is not required.
It can be extended to C++ too the following way:
1 no string mapping to a ctor, everything is done in compile time 2 if a function with the name is used, KSC generates a macrodef having a name drived from processor function and uses the classes with the names generated:
3 the runtime has the following code for every processor function supported: