kaitai-io / kaitai_struct

Kaitai Struct: declarative language to generate binary data parsers in C++ / C# / Go / Java / JavaScript / Lua / Nim / Perl / PHP / Python / Ruby
https://kaitai.io
3.95k stars 192 forks source link

Go language support #146

Open 5k3105 opened 7 years ago

5k3105 commented 7 years ago

I believe the appropriate lower level libraries to use will be: https://golang.org/pkg/bytes/ https://golang.org/pkg/os/

general outline of a go program. Taken from a sprite editor I was making for it's 'low levelness'. github.com/virtao/GoTypeBytes has a lot of functions on bytes that seem applicable for this library.

package main // name of ksc here I think

import (
        "github.com/virtao/GoTypeBytes"
    "golang.org/pkg/bytes"
    "golang.org/pkg/os"
)

type pallete struct {
    r, g, b []int
}

func newpallete() {

    file, err := os.Open(filename) // For read access.
    if err != nil {
        e = "1"
    }

    fi, err := os.Stat(filename)
    if err != nil {
        e = "2"
    }

    data := make([]byte, fi.Size())
    fmt.Println("fi.Size:", fi.Size())

    count, err := file.Read(data)
    file.Close()
    if err != nil {
        e = "3"
    }

    fmt.Printf("read %d bytes: %q\n", count, data[:count])

    var bt bytes.Buffer

    bt.Write(data[:count])

    var r, g, b []int

    for ; count > 0; count-- {

        r = append(r, typeBytes.BytesToInt(bt.Next(1)))

        g = append(g, typeBytes.BytesToInt(bt.Next(1)))

        b = append(b, typeBytes.BytesToInt(bt.Next(1)))

        bt.Next(3)

    }

    p := &pallete{
        r: r,
        g: g,
        b: b,
    }

    fmt.Println(e)

    return p
}
GreyCat commented 7 years ago

Thanks for the suggestion! Let's go with our checklist. I presume you're acquainted with KS, so going to section 2. What would be Go equivalents of the following:

What are the basic Go coding standards / universally accepted practices that we'll follow (also as per checklist, section 2.4)?

5k3105 commented 7 years ago

just taking a stab at this --

KS type: https://golang.org/ref/spec#Numeric_types

uint8       the set of all unsigned  8-bit integers (0 to 255)
uint16      the set of all unsigned 16-bit integers (0 to 65535)
uint32      the set of all unsigned 32-bit integers (0 to 4294967295)
uint64      the set of all unsigned 64-bit integers (0 to 18446744073709551615)

int8        the set of all signed  8-bit integers (-128 to 127)
int16       the set of all signed 16-bit integers (-32768 to 32767)
int32       the set of all signed 32-bit integers (-2147483648 to 2147483647)
int64       the set of all signed 64-bit integers (-9223372036854775808 to 9223372036854775807)

float32     the set of all IEEE-754 32-bit floating-point numbers
float64     the set of all IEEE-754 64-bit floating-point numbers

complex64   the set of all complex numbers with float32 real and imaginary parts
complex128  the set of all complex numbers with float64 real and imaginary parts

byte        alias for uint8
rune        alias for int32

also take a look at https://github.com/virtao/GoTypeBytes/blob/master/typeBytes.go for conversion from and to these types directly from bytes.

KS 'stream': https://golang.org/pkg/bytes/ https://golang.org/pkg/io/ https://golang.org/pkg/os/

I believe these are the packages required but I have not experimented with this yet :)

a collection of fields parsed sequentially (seq)

A slice is a dynamically sized array of type. https://golang.org/ref/spec#Slice_types

fields := []int{1,2,3}

a number of fields parsed randomly or calculated (instances)

Not sure what you want here. There is a key value hash map in Go. So if key is 'field' then:

fields := make(map[string]interface{}) 

this would create a kv map with a string as the key and interface ('any' type that can be cast to correct type later). https://golang.org/ref/spec#Interface_types

illustration of slice and map: https://play.golang.org/p/K1gaQ1zhHg

definition of nested types (types)

is this structs? https://golang.org/ref/spec#Struct_types

lists of integer constants (enums) http://stackoverflow.com/questions/14426366/what-is-an-idiomatic-way-of-representing-enums-in-go#14426447

https://github.com/virtao/GoTypeBytes/blob/master/typeBytes.go That link also has these enums defined:

const (
    INT_SIZE       int = int(unsafe.Sizeof(0))
    UINT_SIZE      int = int(unsafe.Sizeof(uint(0)))
    FLOAT32_SIZE   int = 4
    FLOAT64_SIZE   int = 8
    DEFAULT_ENDIAN int = 0
    BIG_ENDIAN     int = 1
    LITTLE_ENDIAN  int = 2
)
GreyCat commented 7 years ago

KS type:

https://golang.org/ref/spec#Numeric_types

This list is also useful, but the primary question was how to present KS concept of "type". In the majority of target languages, it corresponds to a class. seq attributes usually correspond to private class members + getter / setter methods or properties. instances are more or less the same, but they are lazy (and thus have some sort of "not-yet-parsed" state by default) and they usually require more than one line of code in a getter, given that they usually have some sort of position, sizes, etc.

Reading a seq is usually implemented in a _read method, writing of seq is implemented in _write. In normal, read-only parser mode, class constructor gets KaitaiStream and automatically calls _read to do sequence parsing from that given stream.

How all that maps to Go concepts and practices?

a number of fields parsed randomly or calculated (instances)

Not sure what you want here.

Um, please don't take offense, but may be you'd want to get slightly more familiar with KS concepts, such as seq, instances, enums, nested types, etc?

5k3105 commented 7 years ago

KS type seems to follow: https://golang.org/ref/spec#Types

There are usually no getter setter methods. They are treated as a variable but can also be given methods:

https://gobyexample.com/methods

gives an example of this but with a more complex type, a struct. Yet it is still a 'type' you can see in it's declaration.

Give me a bit longer, I am new to KS and I need to read more. Thanks.

GreyCat commented 7 years ago

They are treated as a variable but can also be given methods:

Yeah, that should be useful as well. We could use that for "parse instances" (which define some out-of-sequence data located at some arbitrary offset in the stream) and "value instances" (which are basically values, calculated using other available attributes).

Give me a bit longer, I am new to KS and I need to read more. Thanks.

No rush, take your time :) Also, you might be interested in announcing this work on support of Go in KS in some Go communities. I've heard that they are very friendly and open to new things — so, chances are, we'll get more people interested in this support contributing ideas. N heads is better than one, anyway :)

5k3105 commented 7 years ago

I'll post something up on the reddit :) Thanks!

5k3105 commented 7 years ago

https://www.reddit.com/r/golang/comments/63oa75/kaitai_struct_a_new_way_to_develop_parsers_for/

tarm commented 7 years ago

I saw this on Reddit and am suddenly interested in Kaitai.

I am familiar with Go, but not with Kaitai or scala. I have read through the documentation about adding support for a new language and believe it should be pretty straightforward to get started with a prototype for Go.

Here are my thoughts on how the representation should be in Go (also a basic introduction to some Go concepts):

const ( Icmp IpProtocol = 1 Tcp IpProtocol = 6 Udp IpProtocol = 17 )


- Go is namespaced by package but otherwise the names for Kaitai types, enumerations, and enumeration values could all collide.  If that is allowed in Kaitai (likely), and collisions are a concern then the generated names might have to be Prefixed_ or something.  Protobuf handles that by prefixing enum values with the enum type name: `IpProtocol_Tcp IpProtocol = 6`.

I tried prototyping the interface with an `Unmarshal()` method, but I think that should return the number of bytes consumed from the stream.  Here is how the standard library does in with the asn1 encoding but that is also a bit awkward: https://golang.org/pkg/encoding/asn1/#Unmarshal

The API for the protobuf encoding and decoding might be interesting (it's also a little bit confusing when you first look at it because of the way the decoding references autogenerated metadata fields): https://godoc.org/github.com/golang/protobuf/proto

Here is an example of how I might generate code:
https://play.golang.org/p/fNbIPlOcpQ
The top level struct (`ExampleFile`) is still a little bit awkward and could use some more thinking.

What are the next steps?  Explicitly following the "new language" document?  @GreyCat Are you available to write the scala code generation if a reasonable template can be provided?  I can work more on the Streaming API.
tarm commented 7 years ago

Here is what the streaming API might look like in Go: https://play.golang.org/p/O1vKDZroZ1

The io.ReadSeeker could be either a an os.File or a bytes.Reader

GreyCat commented 7 years ago

@tarm Thanks for your interest!

names should be CamelCase

Judging by the examples, probably you mean that type names, sequence attribute names and enums-as-consts should be UpperCamelCase. What about instances?

The protobuf compiler handles this by prefixing the nested type with the parent type name and I think that's what the Kaitai compiler should do as well.

Ok, not a problem.

The typical way to handle binary data for a particular type is to add methods to that type and they should be called Marshal() and Unmarshal(). Those methods should return the Go error type and operate on byte slices: []byte.

I believe that goes well with seq, but what about instances? Is there any way to provide lazy parsing support of arbitrary data in current and/or other streams?

Here is an example of how I might generate code: https://play.golang.org/p/fNbIPlOcpQ

Thanks, it makes much more sense now! I believe that stuff like that:

    nn, err := e.TrackTitle.Unmarshal(b[n:])
    n += nn
    if err != nil {
        return n, err
    }

can be incapsulated into some sort of KaitaiStream-like library?

What are the next steps? Explicitly following the "new language" document?

We need:

  1. A minimal working example of what exactly we should generate with KS — and that generally means minimal working implementation of KaitaiStream / KaitaiStruct (if it would be of any use).
  2. Some testing code for that, ideally, one that we can incorporate into our tests repo.

Then I add new target -t go to the compiler, and we try to get it to generate exactly the HelloWorld.go code that you've provided based on hello_world.ksy. As soon as we'll succeed there, we'll start CI testing. Then just keep on porting more test specs to Go and fixing the compiler gradually more and more to make it work.

@GreyCat Are you available to write the scala code generation if a reasonable template can be provided? I can work more on the Streaming API.

Sure, I'm totally up to it ;)

GreyCat commented 7 years ago

Here is what the streaming API might look like in Go: https://play.golang.org/p/O1vKDZroZ1

Cool! I'd like to start a Go runtime repository, but we need to choose name that Go will be referred to everywhere in KS. I guess there are two alternatives - shall we use go or golang - which one is better?

The io.ReadSeeker could be either a an os.File or a bytes.Reader

By the way, what about reading performance when using os.File - would it buffer read calls, so you don't end up with 1000s of one-byte read syscalls?

tarm commented 7 years ago

Instances could also be represented as methods on the top level type. (I'm not familiar with how common they are and what their uses are).

For naming, I think go is preferred.

Is the generated code typically standalone (like the Unmarshal example I gave) or does it typically use metadata/reflection and runtime assistance from the Stream object? Obviously either is possible, but one is probably easier from the existing code generation structure. If reflection is typical, then it might be possible to make something pretty clean looking by putting all the metadata into field attributes that Go calls struct tags. That looks like this https://play.golang.org/p/mtGr8J_szY

os.File is not buffered. It is normally to wrap the io.File in a bufio.Reader though if you want to buffer. Unfortunately bufio.Reader does not have a Seek() method, so there may be a little bit more fiddling to do.

OK let me work more on hello world.

tarm commented 7 years ago

BTW, could you point me to what example generated code looks like? Hello world python for example. I see the unit test in the spec directory, but haven't seen generated code.

GreyCat commented 7 years ago

I've created https://github.com/kaitai-io/kaitai_struct_go_runtime — feel free to start runtime project there. Would you be ok with MIT license, as most our other runtimes use?

Is the generated code typically standalone (like the Unmarshal example I gave) or does it typically use metadata/reflection and runtime assistance from the Stream object?

If I understand properly, generally, there is no metadata/reflection usage inside generated code to maximize performance / efficiency. There is no "parsing state" (except for the state that Stream object provides), we try to minimize various conditions / checks.

If reflection is typical, then it might be possible to make something pretty clean looking by putting all the metadata into field attributes that Go calls struct tags.

That's an interesting feature, but that looks more like "an interpreter of provided metadata, encoded in tags", not "compiler". Generally, "compiler" approach is more efficient.

BTW, could you point me to what example generated code looks like? Hello world python for example. I see the unit test in the spec directory, but haven't seen generated code.

You can use ksc to compile it yourself, or even can use http://kaitai.io/repl/ to check out generation results. For Python, for example, it generates:

# This is a generated file! Please edit source .ksy file and use kaitai-struct-compiler to rebuild

import array
import struct
import zlib
from enum import Enum
from pkg_resources import parse_version

from kaitaistruct import __version__ as ks_version, KaitaiStruct, KaitaiStream, BytesIO

if parse_version(ks_version) < parse_version('0.7'):
    raise Exception("Incompatible Kaitai Struct Python API: 0.7 or later is required, but you have %s" % (ks_version))

class HelloWorld(KaitaiStruct):
    def __init__(self, _io, _parent=None, _root=None):
        self._io = _io
        self._parent = _parent
        self._root = _root if _root else self
        self.one = self._io.read_u1()
ghost commented 7 years ago

@tarm The following issues contain answers on some questions:

tarm commented 7 years ago

Thanks for the pointers! I've made a little progress and would like to share and get feedback:

Runtime stream library: https://github.com/tarm/kaitai_struct_go_runtime/blob/master/kaitai/stream.go

Hello world manual compile and a unit test for it: https://github.com/tarm/kaitai_struct_tests/tree/master/spec/go It looks like the other unit tests are derived from the ruby tests? I am planning to continue with a couple more manual conversions and unit tests, but let me know if those should be autogenerated.

I do not know what the "Kaitai Struct" base type/interface would need to be and would like to explore some examples to get a feeling.

I think next steps for me would be to manually convert a few more .ksy files that have more features. Any suggested progression from Hello World? Perhaps something with a parent reference and something with instances?

Are there plans to have an interface to write files from structures? Naively it seems like all the operations could just be done in reverse.

GreyCat commented 7 years ago

Ok, first I'm gonna answer questions on runtime API:

func (k *Stream) EOF() bool { // Not sure about this one. In Go, an io.EOF is returned as // an error from a Read() when the EOF is reached. EOF // handling can then be done like this: ...

Generally, there are two approaches to EOF handling in this world:

Given that we inevitably deal with languages that have both schemes, we had to choose a single common pattern here, and we've chosen the first one. Languages like C++ and Python, that rely on the second pattern, will thus have slightly more complex implementations - like this one in C++ or this one in Python. Probably we'll have to implement this with Go as well.

// Go's string type can contain any bytes. The Go range operator // assumes that the encoding is UTF-8 and some standard Go libraries // also would like UTF-8. For now we'll leave any advanced // conversions up to the user.

What's the difference between a string a []byte then? If there's some standard to use UTF-8, probably we'll just need to convert everything to UTF-8. It's not like we have many things that depend on exact string behavior right now, but generally it's expected that length method called on a string will return a number of characters, not bytes, reverse will reverse characters correctly, etc.

// FIXME what does this method do? func (k *Stream) ReadBitsArray(n uint) error {

Generally, it is supposed to read a bitset / bit array data type with given number of bits. See #112 for more details, it's right now in planning stage. Probably we'll use BitSet for Java, BitArray for .NET, std::bitset for C++, etc.

func (k Stream) ReadStrEOS(encoding string) (string, error) { func (k Stream) ReadStrByteLimit(limit int, encoding string) (string, error) {

I see that you've implemented older v0.6 API here — probably because I've referred you to some outdated documentation, and I'm truly sorry for that. Modern 0.7 API does not have distinct string reading methods, instead it uses byte array reading methods (read_bytes, read_bytes_full, read_bytes_term) + convert byte array to string method (bytes_to_str). Please take a look at any existing runtimes to see what's going on with byte arrays and strings there in meantime, and I'll try to bring all API specs up to date soon.

// FIXME what is group_size ? func ProcessRotateLeft(data []byte, amount int, group_size int) {

Generally, it is supposed to split given byte array into groups of group_size bytes, and rotate each of these groups left by a given amount of bits, wrapping on an edge of the group. Thus, for example, rotating 12 34 ab cd with amount of 4 and group_size of 2, should bring us 23 41 bc da:

12       34      |ab       cd
00010010 00110100|10101011 11001101
   /         /
  /         /
 /         /
00100011 01000001|10111100 11011010
23       41      |bc       da

Using default group_size of 1 will rotate everything within a single byte, thus yielding:

12       34      |ab       cd
00010010 00110100|10101011 11001101
   /         /
  /         /
 /         /
00100001 01000011|10111010 11011100
21       43      |ba       dc
GreyCat commented 7 years ago

-- strz what is the intention of all the arguments?

Generally, strz is deprecated in favor of read_bytes_term. Here's more or less typical implementation for languages that lack native "read until terminator" functionality. Arguments match .ksy keys closely, that is:

-- do the processing functions operate on the buffers in-place? I assumed so.

Normally, they operate at copies and return them, but it might be customized for particular target language.

In the abstract, I don't fully understand a seekable stream API. If it's seekable, then you must be able to rewind which means it's not a stream unless you've buffered it. If you're going to have to buffer it, then you might as well buffer the whole stream before processing. Go has good interfaces for streaming and good interfaces for seeking on buffered data, but not both at the same time.

It's clearly a terminology question ;) I see that Go's understand of "stream" term implies that it's something that is not seekable and you may be able only to rewind it with some extra pain (doing that buffering). In other languages (and more or less in KS), "stream" is used as a term that just describes an abstraction over generic reading / writing API that deals with files and in-memory byte arrays. Thus it's perfectly fine to have a "seekable stream": for example, Ruby's IO or Python's io work in that paradigm.

Is Seek() only needed for instances with positions?

Yes, it's more or less so.

Can you point to a .ksy that needs a Seek() method?

Sure, here's probably the minimal one: https://github.com/kaitai-io/kaitai_struct_tests/blob/master/formats/instance_std.ksy

For example, in Python it generates:

    @property
    def header(self):
        if hasattr(self, '_m_header'):
            return self._m_header if hasattr(self, '_m_header') else None

        _pos = self._io.pos()
        self._io.seek(2)
        self._m_header = (self._io.read_bytes(5)).decode(u"ASCII")
        self._io.seek(_pos)
        return self._m_header if hasattr(self, '_m_header') else None

I'm tempted to defer the Seek() implementation for a little while until I can explore how the generated code for those should look. Maybe there could be 2 APIs, streaming and seeking, and only .ksy's the need seeking would consume the seeking interface?

That's up to you to decide as Go expert ;) The general requirement is that parsing many real-life file formats (and even some network packets, like DNS) requires seeking. If you feel like providing several implementations (one that to be used when seeking is required and one that can be used for simpler types with no seeking) — sure, why not :)

The only language that might have benefited from such approach so far was probably Java (it has unseekable stuff like FileInputStream). I've discussed that with like a dozen of Java experts, and we've more or less came to a conclusion that it's generally unpractical to support it.

5k3105 commented 7 years ago

Happy to read along :) :+1:

GreyCat commented 7 years ago

@tarm I've started GoCompiler, and went over your rendition of hello_world.go.

I have a few questions:

  1. As far as I can see, Unmarshall needs mandatory stream argument s *kaitai.Stream. Normally, default stream is expected to be stored in a class, so it can be reused by instances. Instance attributes are normally supposed to be called exactly the same as sequence attributes, i.e. whether you have:
seq:
  - id: foo1
    type: u1
instances:
  foo2:
    pos: 1234
    type: u1

You can access both foo1 and foo2 the same way. In Java, for example, it's r.foo1() and r.foo2(), in Python, it's r.foo1 and r.foo2, etc.

So, shall we add some field (traditionally we call it _io, but Go would probably have slightly different idea on identifier naming) into a type, and store s *kaitai.Stream argument into that field in Unmarshall?

  1. This HelloWorld type completely ignores references to root and parent objects. Of course, they are not really needed for HelloWorld, but probably we don't want to drop that support anyway — who knows, maybe someone would want to include HelloWorld type in some other type?

Normally, we allow to supply root and parent arguments in a class constructor. Given that Unmarshall is effectively a constructor, may be we should add root and parent arguments there as well, and store them in a type?

matttproud commented 7 years ago

Is this anything you would want some help with? I stumbled across this project on a whim after having implemented decoders for a bunch of legacy game file formats in Go. I would be very interested in avoiding future tedium by defining future data spec in Kaitai, and letting a code generated codex do the rest of my work.

Is it possible to author a per-language emitter as a plugin from IDL in a native language à la Protocol Buffers?

GreyCat commented 7 years ago

@matttproud Please follow the discussion in this issue. I've asked some questions and I'm still waiting for Go expert opinions on them. It's not like I can move this whole thing alone. Unfortunately, it looks like that we've lost original proposers and experts @5k3105 and @tarm :(

Is it possible to author a per-language emitter as a plugin from IDL in a native language à la Protocol Buffers?

If I understood you correctly, this is already like that. All we need to write to support a certain target language is ThatLanguageNameCompiler class (which is mostly a template that gets filled with values to generate KS output) and ThatLanguageNameTranslator class (which is also more or less the template, but used for expression language).

cugu commented 7 years ago

Usually Unmarshal should be something like:

func Unmarshal(data []byte, ks *KaitaiStruct) error {
    …
}

See protobuf, json or asn

The streaming api could be something like:

byteStream := []byte{0xf8, 0xf9, 0xfa, 0xfb, 0xfc, 0xfd, 0xfe, 0xff}
reader := kaitai.NewReader(byteStream)
dec := kaitai.NewDecoder(reader)

kaitaiRootStruct := FooBar{}

err := dec.Decode(&kaitaiRootStruct)
if err != nil {
    log.Fatal(err)
}

Similar in json

That way the decoder handles the stream and the relations (parent, root) etc. This Go way pretty much contradicts the Kaitai way of storing the stream and relations in every object.

cugu commented 7 years ago

To just add relations (IO, parent, root) to @tarm 's hello world see https://github.com/cugu/kaitai_struct_go/blob/master/hello_world.go and https://github.com/cugu/kaitai_struct_go/blob/master/struct.go

GreyCat commented 7 years ago

@cugu My main question is whether Unmarshall is strictly required, or it is just a convention, and how heavily it is enforced / frowned upon if we'll do it slightly differently? As far as I understand, it's really just a static method that fills in a pre-allocated structure.

The problem is that KS generally is not that simple, and you don't have a luxury of parsing anything in one function and just going in with approach "I'll call this one can and after that I'm guaranteed to have all the fields of structure X filles in".

There is lazy parsing, and it is especially important for "parse instances": it is used to parse multi-terabyte file systems without loading everything into memory at once. Normally, you just generate something like that in languages like Java:

    public IndexEntry index() {
        if (this.index != null)
            return this.index;
        long _pos = this._io.pos();
        this._io.seek(indexOffset());
        this.index = new IndexEntry(this._io, this, _root);
        this._io.seek(_pos);
        return this.index;
    }

Note that it makes heavy use of this._io, i.e. object remembers a reference to a stream from which data should be acquired.

I see that you've proposed to add Metadata member and type to store that information. It will probably be ok with IO (although probably the correct naming is _Metadata in this case, so it won't clash with any existing member named Metadata). How would it handle parent relationships, as usually parent data type is not just "Struct", but inferred in ksy compile time to be something more specific, i.e., for example, how would you compile nav_parent.ksy.

tarm commented 7 years ago

@GreyCat Thanks for all the background and suggestions. It's very helpful. Sorry for being MIA, $DAYJOB has kept me busy. I have been thinking about it though!

Sorry, I have not yet updated the stream api with your comments, but the comments and changes to the stream API seem straightforward.

Go does not have python's __attr__ so the way to make seq consistent with instance is to either have getter accessor methods everywhere or have all fields be flat.

As @cugu mentioned, there is definitely a preference in Go to have data types be flat and add functionality using methods. That is different than the standard KS way. Obviously that does not work for terabyte filesystem parsing.

However, if you have to parse a terabyte filesystem, you could still use the auto generated code but only call the "inner" unmarshal functions and manually write code to jump or iterate through as needed for your business logic.

As a Go user, my preference would be to have the Unmarshal() method and not embed the KS struct into each object.

I've been thinking more about the parent and root references as well. (This suggestions presumes unmarshal). You could pass in the needed leaf fields into Unmarshal() as extra arguments. That means more bookkeeping for the code generator because all the intermediate Unmarshal functions would also need those arguments, but would be easy to understand and satisfy the type checker. I just pushed an example for nav_parent: https://github.com/tarm/kaitai_struct_tests/blob/master/spec/go/nav_parent.go

Obviously that is also different from the standard KS way as well. Go is statically typed enough to make the parent references a little painful otherwise. Doing it the more typically KS way would be easy with root, but trickier with parent references if a type could have parents of different types. How is that handled in the generated code for other statically typed languages? Is the .ksy format statically typed enough for that to not be a problem?

Finally, in nav_parent, I had to add an int() conversion to go from a uint32 field to a length in ReadBytes(). Should ReadBytes take uint64 as an argument? Does the KS compiler add the appropriate conversions for references like that? Does the KS compiler add overflow checking before conversions?

GreyCat commented 7 years ago

Go does not have python's __attr__ so the way to make seq consistent with instance is to either have getter accessor methods everywhere or have all fields be flat.

Ok, I start to understand that Go is a pretty low-level language, as I go ;) It's not terribly hard to do these distinctions automatically for generated code (i.e. foo.bar.baz being translated to Foo.Bar.Baz or Foo.Bar().Baz or Foo().Bar.Baz() or whatever's needed), but then again, it's not super user-friendly, I guess. Given that getter methods add no overhead, I guess we could probably generate them just in case (or at least start with that approach and tune it afterwards with CLI switches).

As a Go user, my preference would be to have the Unmarshal() method and not embed the KS struct into each object.

It is possible, but it would basically require end-user to keep references to root, parent and io somewhere, and pass it along to every instance attribute call, i.e. Foo.Bar(io, parent, root).Baz(io, parent, root), etc. Also, technically it's possible to implement both styles of code generation and use some CLI switch to choose between them. If are not 100% opposed to it, I'd start with more "traditional" object-oriented approach, i.e. storing these references in generated structs.

How is that handled in the generated code for other statically typed languages? Is the .ksy format statically typed enough for that to not be a problem?

Parent type is inferred in pre-compile-time in ksc, and is known since then. For example, in Java, nav_parent.ksy is compiled to:

public class NavParent extends KaitaiStruct {
    public NavParent(KaitaiStream _io, KaitaiStruct _parent, NavParent _root) { /* ... */  }
    public static class HeaderObj extends KaitaiStruct {
        public HeaderObj(KaitaiStream _io, NavParent _parent, NavParent _root) { /* ... */ }
    }
    public static class IndexObj extends KaitaiStruct {
        public IndexObj(KaitaiStream _io, NavParent _parent, NavParent _root) { /* ... */ }
    }
    public static class Entry extends KaitaiStruct {
        public Entry(KaitaiStream _io, IndexObj _parent, NavParent _root) { /* ... */ }
    }
}

Note that both _parent and _root types vary in all these classes, and in most cases it is not KaitaiStruct, but some more specific class that inherits from KaitaiStruct.

Finally, in nav_parent, I had to add an int() conversion to go from a uint32 field to a length in ReadBytes(). Should ReadBytes take uint64 as an argument? Does the KS compiler add the appropriate conversions for references like that? Does the KS compiler add overflow checking before conversions?

By design decision, KS expression language assumes that there's only a single one-size-fits-all integer type in a language (which is not that far from being true in many cases) and doesn't track complex integer type implicit casts, which are very different in various languages.

Implementation of stuff like ReadBytes() thus can be one of:

There's no overflow checking everywhere, and this is almost mostly by design. Current approachin vast majority of existing situations and we don't really aim to implement 100% compliant exact emulator of some particular language in all other languages. We just say that it's "undefined behavior" (which is pretty much the standard excuse in C/C++ standard, so we're not alone here either).

LogicAndTrick commented 7 years ago

It is possible, but it would basically require end-user to keep references to root, parent and io somewhere, and pass it along to every instance attribute call, i.e. Foo.Bar(io, parent, root).Baz(io, parent, root), etc.

Off topic but if I recall correctly, this is the major choke point I had with Rust support in #22 (possibly for a slightly different reason than with Go). Storing the parent/root references was very hard because of Rust's "you can only have one mutable reference, or multiple immutable references, but never both at once" philosophy. If we could go with this end-user-controlled approach, I believe that I could continue working on Rust support. There'd still be some pretty hard limitations when dealing with instances but it'd be better than nothing at all. Do you think we could make this method an alternate option for languages where the "traditional" KS way doesn't fit into the target language very well?

GreyCat commented 7 years ago

@LogicAndTrick Probably it's possible, but note that this could quickly get quite messy. Even if you have only a single stream, and straightforward chain of parenting, in real life with foo.bar.baz you'll end up with something like

this.foo(io, this, this).bar(
    io, this.foo(io, this, this), this
).baz(
    io, this.foo(io, this, this).bar(
        io, this.foo(io, this, this), this
    ), this
)

Pardon my Java-esque pseudo-code, hope it's understandable. If you're creating substreams, or have something less trivial for parenting chain, it would quickly get much uglier.

LogicAndTrick commented 7 years ago

Hmm, yeah. It would kinda only work for very simple cases. The compiler would probably get much more complex if it had to manually track parent/root references everywhere...

GreyCat commented 7 years ago

Compiler probably won't be much more complex (it already does that tracking), but generated code probably would be, and, if I'm not mistaken, recalculating all that stuff every time is not exactly performance-savvy as well.

LogicAndTrick commented 7 years ago

Hm, it probably won't work anyway, because you can't really do stuff like _parent._parent when each object doesn't have a reference to its own parent... You can't really expect the caller (or compiler?) to pass in every parent of an object that might be 10 levels deep in the hierarchy :/

cugu commented 7 years ago

My example of the nav_parent:

Some notes:

  1. I guess members need to be functions to enable lazy parsing. This results in longer source code, but should be fine otherwise.
  2. ._Parent() calls are casted to the correct object.
GreyCat commented 7 years ago

Thanks for everyone's input. Looks like I've managed to hack up a very early compiler that compiles hello_world.ksy and run-go script in the tests repo that runs the tests:

ksc is supposed to be invoked with --go-package test_formats — note that compiler creates src/$GO_PACKAGE_NAME/$TOP_LEVEL_TYPE.go file, i.e. path with the directory structure.

I'll proceed with getting nav_parent to work as well.

GreyCat commented 7 years ago

Ok, I've done some experiments with Go, trying to get either nav_parent or floating_points to work, and here's what I've came up with so far.

Go is not "everything is expression" language

This is actually pretty bad, as everything we've done previously so far fit "everything is expression" paradigm. In Go, given the lack of exceptions, basically every step of expression calculation needs to be checked and extra if-return statements added. For example, something very crude like foo.bar(123).baz(456), becomes big in Go:

if tmp1, err := foo.bar(123); err != nil {
        return err
}
if tmp2, err := tmp1.baz(456); err != nil {
        return err
}
// and now we must proceed with tmp2

This is not only about expressions language, this also affects regular parsing statements. While is all other languages we clearly separate "parsing as expression" and "assignment" steps, i.e. for simple assignments we do:

this.foo = $X;

where $X can be as simple as:

_io.readU1()

or complex, such as

new String(this._io.readBytes(16), Charset.forName("UTF-8"))

And we can reuse exactly that expression for other situations like adding an element to array:

this.foo.add($X);

It turns out that all the parsing statements would be much more complex in Go. For example, in Java:

// Parsing + assignment of a single field
this.foo = this._io.readU1();
// Parsing + assignment of an array element
this.foo.add(this._io.readU1());
// Parsing of a string
this.foo = new String(this._io.readBytes(16), Charset.forName("cp437"));
// Parsing of a string array element
this.foo.add(new String(this._io.readBytes(16), Charset.forName("cp437")));

In Go, it turns out :

// Parsing + assignment of a single field
if this.foo, err = this._io.ReadU1(); err != nil {
    return err
}

// Parsing + assignment of an array element
if this.foo[i], err = this._io.ReadU1(); err != nil {
    return err
}

// Parsing of a string
tmp1, err := this._io.ReadBytes(16)
if err != nil {
    return err
}
this.foo, err := some_way_to_convert_byte_array_to_utf8_string(buf, "cp437")
if err != nil {
    return err
}

// Parsing of a string array element
tmp1, err := this._io.ReadBytes(16)
if err != nil {
    return err
}
this.foo[i], err := some_way_to_convert_byte_array_to_utf8_string(buf, "cp437")
if err != nil {
    return err
}

Difficulties with user structures management

As far as I see, user structure can be declared either as foo UserType or foo *UserType. In C/C++, this roughly equivalents to allocation on stack and on heap, in Go things seem to be somewhat more tricky. In KS-generated code in C++, we use heap-allocated structures exclusively, due to the fact that:

As far as I understand, doing stuff like

type MyStruct struct {
    foo Substruct
}

effectively preallocates structure foo inside MyStruct, which is bad due to reasons stated above (or, at the very least, it does not fit all the use cases). "Reading" of such substructure is then a matter of calling Read method on it (note, however, that this is yet another variation of parsing syntax that doesn't fit "everything is just expression" paradigm):

if err = this.foo._Read(this._io, this._parent, this._root); err != nil {
    return err
}

However, if we do:

type MyStruct struct {
    foo *Substruct
}

then we need to construct foo as well, this is yet another separate statement, and yet another syntax:

this.foo = new(MyStruct)
if err = this.foo._Read(this._io, this._parent, this._root); err != nil {
    return err
}

I don't quite get it yet how construction of arrays work in Go as well. @cugu provided this in his example:

type IndexObj struct {
    entries []Entry
}

// ...
o.entries = make([]Entry, header.QtyEntries())

I guess it preallocates not only space for pointers, but also allocates space for elements itselves as well. What about repeat-until or other repetitions with no known limit beforehand (so they need to grow)? Shouldn't we use []*Entry (is that even possible?), and allocate space for structures using new() manually?

GreyCat commented 7 years ago

More thoughts: as instances for non-trivial calculations that might throw an exception this means that methods that implement instance must return 2 values: actual result and err. This again makes it pretty hard to unify with sequence attributes getters, which are actually known to return no error. The choices are:

  1. Make everything return (result, err). This would make sequence attributes access a real pain, i.e.:
func (this *FloatingPoints) SingleValueBe() (r float32, err error) {
    return this.singleValueBe, nil
}
  1. Remove sequence attributes accessors, access them directly:
type FloatingPoints struct {
    SingleValueBe float32
}

This leads to distinction between accessing seq and instance attributes, i.e. foo.Bar vs foo.Bar() + do error checks.

  1. Do two types of getters (seq attributes return value, instance attributes return value + err) — pretty much pointless, if we'll get different methods to access stuff, then we might as well just go with (2).

Ideas, comments?

ghost commented 7 years ago

@GreyCat Can we just exit in case of error? Something like this:

package main

import (
    "errors"
    "fmt"
    "os"
)

func (foo *Foo) Bar(v int64) *Foo {
    fmt.Printf("Bar() called with %d\n", v)

    // Do something if error, exit with error code
    /*
    err := errors.New("Something was wrong")

    if nil != err {
        fmt.Fprintf(os.Stderr, "error: %v\n", err)
        os.Exit(1)
    }
    */

    return foo
}

func (foo *Foo) Baz(v int64) {
    fmt.Printf("Baz() called with %d\n", v)
}

type Foo struct {
}

func main() {
    foo := Foo{}
    foo.Bar(123).Baz(456)
}
GreyCat commented 7 years ago

I believe it's considered a bad practice in Go. And, even if we'd opted to exit / panic on every error, this won't help a lot anyway: we have to process lots of existing library calls that do return error as second return value anyway, so it doesn't really matter if we do if check + do os.Exit or we do if check + do return. It is still very different from all other languages, and looks pretty close to proper C.

koczkatamas commented 7 years ago

IMHO os.Exit in a 3rd-party library is a very bad idea generally.

koczkatamas commented 7 years ago

Meanwhile: sorry I did not read through the whole thread, but would that be a good solution if KaitaiStream use panic instead of errors and only the outermost class would recover it?

It's a bit hacky, but it's something similar to exceptions in other languages and could make the code much simpler (although I am not sure I understand fully how it works).

agnivade commented 7 years ago

Apologies for butting in to the thread. I have been following it for a long time. Seems like you guys are doing some great progress. Just wanted to comment a bit on this -

In Go, given the lack of exceptions, basically every step of expression calculation needs to be checked and extra if-return statements added

and

It turns out that all the parsing statements would be much more complex in Go.

Well, not necessarily. Go author Rob Pike has wrote an article which handles this case quite elegantly. Not sure, if you have read it or not, but I think his article fits wonderfully to what we are trying to achieve here.

ghost commented 7 years ago

Here are some links which I found useful related to panic and error handling:

GreyCat commented 7 years ago

Actually, let's talk about expressions in Go in general. For example, consider we're compiling a value instance like that:

meta:
  id: example
seq:
  - id: buf
    type: u4le
    repeat: eos
instances:
  foo:
    value: (buf[1].to_s + "123").to_i

This ksy normally reads everything it can into array of 4-byte unsigned integers named buf. foo instance is calculated as following:

For reference, this expression is compiled as following in Java:

Long.parseLong(Long.toString(buf().get((int) 1), 10) + "123", 10)

in Python:

int(str(self.buf[1]) + u"123")

in Ruby:

(buf[1].to_s(10) + "123").to_i

or in C++:

std::stoi(kaitai::kstream::to_string(buf()->at(1)) + std::string("123"))

As far as I understand, proper by-statement Go rendition of this expression would be:

tmp1 := buf[1]
// no check needed, because accessing non-existent slice members is panic
tmp2 := strconv.Itoa(tmp1)
tmp3 := tmp2 + "123"
tmp4, err := strconv.Atoi(tmp3)
if (err != nil) {
    return 0, err
    // note that we need to understand resulting type to generate
    // valid "no answer" first return argument; just returning "nil, err"
    // is not valid :(
}
foo = tmp4
return foo, nil

Of course, when no checks are needed (i.e. when step is know to be non-failing or it uses panic to fail), these parts can be merged into something resembling at expression, i.e.:

tmp4, err := strconv.Atoi(strconv.Itoa(buf[1]) + "123")
if (err != nil) {
    return 0, err
}
foo = tmp4
return foo, nil

Probably even tmp4 and foo can be merged, but that's another optimization. If we'll be doing this, we'll start from "full" statement-by-statement form, I guess, and try to reduce it later.

All this stuff really reminds me of C, which is technically good, as it brings us closer to C implementation as well...

GreyCat commented 7 years ago

Meanwhile: sorry I did not read through the whole thread, but would that be a good solution if KaitaiStream use panic instead of errors and only the outermost class would recover it?

It will only help in one area, i.e. it will make instance attributes vs seq attributes getters' signatures more alike, but it won't help at all with all the existing library functions that work with returning error codes instead of using panic. We'll still have to break expressions as I've illustrated step-by-step and insert these if-return (or if-panic) on every occasion.

GreyCat commented 7 years ago

@agnivade

Well, not necessarily. Go author Rob Pike has wrote an article which handles this case quite elegantly.

As far as I understand from reading the article, it basically promotes "store error state somewhere in object and check in the very end of process" semantics. I don't really get how it would help with existing stdlib functions, which are already implemented in "return value-error pair" paradigm.

agnivade commented 7 years ago

@GreyCat - Existing stdlib functions will remain the same. Its only useful when you call those functions repeatedly. Then you can wrap the error handling and state in a struct and call your functions. Maybe its not applicable here, but something to keep in mind, just in case.

If we'll be doing this, we'll start from "full" statement-by-statement form, I guess, and try to reduce it later.

:+1:

GreyCat commented 7 years ago

Ok, 3 days and several dozens of kilobytes code later, I've done a major revamping of translator system and GoCompiler. Now hello_world and floating_points compile and test successfully. I guess everyone's interest in Go target faded away (as nobody still answered most of my questions), but I'll try to highliught some of the newly introduced things anyway ;)

func (this *FloatingPoints) DoubleValuePlusFloat() (v float64, err error) {
    if (this._f_doubleValuePlusFloat) {
        return this.doubleValuePlusFloat, nil
    }
    this.doubleValuePlusFloat = float64((this.DoubleValue + 0.05))
    this._f_doubleValuePlusFloat = true
    return this.doubleValuePlusFloat, nil
}

This is trivial value calculating instance, without any early bailouts and error checks. For compatibility, all instances follow the same pattern and return (value, err), even those who have no way to return error even in theory.

    singleValuePlusFloat, err := h.SingleValuePlusFloat()
    if err != nil {
        t.Fatal(err)
    }
    assert.InDelta(t, 1.0, singleValuePlusFloat, delta)

I'll try to take on some of other tests as well now.

GreyCat commented 7 years ago

Done str_encodings_test as well, instances support is on the way.

This one requires BytesToStr(...) runtime library function, so I've added it:

func BytesToStr(in []byte, decoder *encoding.Decoder) (out string, err error) {
    i := bytes.NewReader(in)
    o := transform.NewReader(i, decoder)
    d, e := ioutil.ReadAll(o)
    if e != nil {
        return "", e
    }
    return string(d), nil
}

@tarm Could you please confirm that you're ok to publish your runtime under MIT license, so I can put it into "proper" kaitai-io location, add READMEs and LICENSE, and continue working on that in public?

GreyCat commented 7 years ago

Ok, last one for today: I've completed position_abs_test as well.

It required ReadBytesTerm(...) runtime library function, a very crude one is like that:

func (k *Stream) ReadBytesTerm(term byte, includeTerm, constumeTerm, eosError bool) ([]byte, error) {
    r := bufio.NewReader(k)
    return r.ReadSlice(term)
}

Right now it fails with:

=== RUN   TestPositionAbs
--- FAIL: TestPositionAbs (0.00s)
        Error Trace:    position_abs_test.go:32
        Error:          Not equal: 
                        expected: "foo"
                        received: "foo\x00"

which is totally understandable.

This one actually is a pretty complex one, as it demonstrates parse instances fully working and bytes-to-string encoding process.

Ok, now we've got 4 tests, so I'd propose to stop further development of ksc GoCompiler for now, review everything by those who are interested in Go target and solve some infrastructure / system issues, i.e.:

5k3105 commented 7 years ago

I think he means wrapping the stdlib functions to handle errors like this. How many functions from go libs will we be using? Say there are 20, these would all be wrapped to handle errors in an 'error log' way. Then you would call only the wrapped functions insted of stdlib directly.

Error handling is a sore point with many in the go community.

Sorry I have not been around because I have been overtime @ work for the past 2 months. Still would like to see this completed - but I think those who've responded here are probably more experienced with go than me :)