kaitai-io / kaitai_struct_visualizer

Kaitai Struct: visualizer and hex viewer tool
https://rubygems.org/gems/kaitai-struct-visualizer
GNU General Public License v3.0
284 stars 25 forks source link

Invalid byte sequence crashes ksdump 0.7 #61

Open ross-spencer opened 1 year ago

ross-spencer commented 1 year ago

I an seeing ksdump crash with the following:

ksdump prescription-sample-1.0-le.eygl kaitai/eyeglass_little_endian.ksy 
Compilation OK
... processing kaitai/eyeglass_little_endian.ksy 0
...... loading eyeglass_format.rb
Classes loaded OK, main class = EyeglassFormat
/usr/lib/ruby/3.0.0/psych/visitors/yaml_tree.rb:268:in `visit_String': invalid byte sequence in UTF-8 (ArgumentError)
    from /usr/lib/ruby/3.0.0/psych/visitors/yaml_tree.rb:136:in `accept'
    from /usr/lib/ruby/3.0.0/psych/visitors/yaml_tree.rb:330:in `block in visit_Hash'
    from /usr/lib/ruby/3.0.0/psych/visitors/yaml_tree.rb:328:in `each'
    from /usr/lib/ruby/3.0.0/psych/visitors/yaml_tree.rb:328:in `visit_Hash'
    from /usr/lib/ruby/3.0.0/psych/visitors/yaml_tree.rb:136:in `accept'
    from /usr/lib/ruby/3.0.0/psych/visitors/yaml_tree.rb:118:in `push'
    from /usr/lib/ruby/3.0.0/psych.rb:513:in `dump'
    from /usr/lib/ruby/3.0.0/psych/core_ext.rb:13:in `to_yaml'
    from /var/lib/gems/3.0.0/gems/kaitai-struct-visualizer-0.7/bin/ksdump:122:in `<top (required)>'
    from /usr/local/bin/ksdump:25:in `load'
    from /usr/local/bin/ksdump:25:in `<main>'

Example file.

Example ksy.

Via: https://github.com/ross-spencer/eyeglass/tree/main

I'm not clear on if this is a user error, or if there's a newer version of ksdump I should be looking at?

GreyCat commented 1 year ago

@ross-spencer, thanks for reporting this!

It looks like in this specific example, the caveat is with the fact that Ruby UTF-8 parsing is quite "forgiving" in terms of getting byte sequences which are not valid UTF-8 characters as strings, e.g.

[.] magic = "\xBB\r\neyeglass\u001A\n\xAB" [.] eof = "\xBBeof"

This backfires when trying to use standard YAML or JSON libraries trying to dump these, though, as they give up.

To be honest, I'm not sure what behavior we'd want in this case. Probably reporting a more precise error should be a good idea (e.g. down to a specific attribute path). Showing that string is not a valid UTF-8 in ksv might be a good idea too, but then ksv is supposed to be "all forgiving", and, it actually is right now.

The simplest fix on .ksy side would be removing type: str and encoding: utf-8. This yields raw bytes rendition:

axis_right_left:
- 130
- 80
base_right_left:
- 0.0
- 0.0
cylinder_right_left:
- -0.25
- -1.0
datetime: '2012-11-08T12:37:50'
distance_acuity_right_left:
- 0.6600000262260437
- 0.5
endianness: 1
eof: BB 65 6F 66
format_expansion_room: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
  00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
  00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
  00 00 00 00 00 00 00 00 00 00 00 00 00 00
magic: BB 0D 0A 65 79 65 67 6C 61 73 73 1A 0A AB
near_acuity_right_left:
- 12
- 12
next_checkup_years: 1.0
observation9: "Patient's eyesight needs correction. History of diabetes in family
  but indicators found. Standard checkup interval recommended.\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0"
prism_right_left:
- 0.0
- 0.0
purpose: "Distance and Close Work.\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0"
sphere_right_left:
- -3.3499999046325684
- 0.5
version: '01'
GreyCat commented 1 year ago

From ksdump's side, I can propose several solutions. While doing recursive traversal, check for string validity (e.g. value.valid_encoding?). Then, if we detect the problem:

Thoughts?

ross-spencer commented 1 year ago

@GreyCat thanks for that. I see where my assumptions were wrong now. I took your approach to remove the offending str and encoding, but then decided I could get a lot of value out of turning the magic into a type which gives me some more granular output. I'll need to see what that does for me if using kaitai to auto-generate parsing code.

ksdump has been useful for debugging here, I'll reflect on that in my writeup. From my perspective, out of the potential changes, then:

Note where this is in the tree and report the error, pointing to that location

Would be a useful starting place, as for string substitution, or dumping raw bytes, then they seem nice options, but providing they didn't obscure the user error in any way, i.e. it was still clear that the ksy file contains a potentially invalid instruction/the input file contains some unexpected data.