kaitai-io / kaitai_struct

Kaitai Struct: declarative language to generate binary data parsers in C++ / C# / Go / Java / JavaScript / Lua / Nim / Perl / PHP / Python / Ruby
https://kaitai.io
3.96k stars 192 forks source link

CalcIntType "int" might lead to loss of data (in Java) #510

Open ams-tschoening opened 7 years ago

ams-tschoening commented 7 years ago

Consider the following user type to represent a 6 byte unsigned integer:

u6le:
  seq:
    - id:   lsb
      type: u1
    - id:   middle
      type: u4
    - id:   msb
      type: u1

And the generated Java:

    private Integer value;
    public Integer value() {
        if (this.value != null)
            return this.value;
        int _tmp = (int) ((((msb() << 40) | (middle() << 8)) | lsb()));
        this.value = _tmp;
        return this.value;
    }
    private int lsb;
    private long middle;
    private int msb;

The cast is wrong in this case, isn't it? As far as I can see, the compiler maps the calculation to the type CalcIntType and that is mapped to int by convention and that not only in Java, but in other languages like C++ as well. In the most of my cases this is a reasonable choice, but the example above is one exception to the rule. :-)

So shouldn't CalcIntType be changed to long just to be sure, at least in Java?

Best would be of course that the compiler could check all parts of the calculation and derive the type from the largest, but currently it doesn't seem able to do so.

On the other hand, long might waste memory in some cases. But because of AutoBoxing used pretty often in KS, this ins't necessarily too bad. Callers could simply work with the already existing objects instead of primitive types. Forwarding the already available objects from KS shouldn't have any bad influence on memory footprint at all.

GreyCat commented 7 years ago

As far as I remember, the issue with CalcIntType being mapped to int in the first place was because you can't do stuff like arrayList.get(x) and byteArray[x] where x is long. Need to double-check that.

ams-tschoening commented 7 years ago

And you are correct with that, as can be seen in my pull request where I needed a workaround for the CTOR of ArrayList as well. A bit surprising that I don't have such a breaking case in my current KSY, so most of the times recognizing int seems to work and might make that a non-problem.

The following could easily be changed:

override def doSubscript(container: expr, idx: expr): String = s"${translate(container)}.get((int) ${translate(idx)})"

to

override def doSubscript(container: expr, idx: expr): String = s"${translate(container)}.get(Long.valueOf(${translate(idx)}).intValue())"

For small indices this should even be optimized by Java to not create a new Long object.

Don't find the byte[...] part currently...

ams-tschoening commented 7 years ago

I wonder if accessing byte[x] is even implemented? Can't find an example and if I do the following:

instances:
  dummy:
    value: "[1, 2, 3, 4, 5]"
  dummy_0:
    value: dummy[0]

I get an exception in BaseTranslator already:

Exception in thread "main" io.kaitai.struct.translators.TypeMismatchError: unable to apply operation [] to CalcBytesType at io.kaitai.struct.translators.BaseTranslator.detectType(BaseTranslator.scala:311) at io.kaitai.struct.TypeProcessor$$anonfun$deriveValueType$1.apply(TypeProcessor.scala:76) at io.kaitai.struct.TypeProcessor$$anonfun$deriveValueType$1.apply(TypeProcessor.scala:69) at scala.collection.immutable.HashMap$HashMap1.foreach(HashMap.scala:221) at scala.collection.immutable.HashMap$HashTrieMap.foreach(HashMap.scala:428) [...]

Changing one of the values to a value larger than 255 works instead, e.g. in Java ArrayList is used and the index is accessed using get.

GreyCat commented 7 years ago

Yeah, looks like I forgot to implement that one :( We need more tests for that (even simple [1, 2, 3, 4, 5][2] should work) and probably implementations in all the languages.