RPM: not all strings are UTF-8

kaitai-io / kaitai_struct_formats

Kaitai Struct: library of binary file formats (.ksy)

http://formats.kaitai.io

712 stars 203 forks source link

RPM: not all strings are UTF-8 #672

Open armijnhemel opened 1 year ago

armijnhemel commented 1 year ago

In the current rpm.ksy the encoding for strings is set to UTF-8. There are RPM files that fail to parse, because as it turns out not everyone has been playing nice with encodings.

An example is this file from Fedora Core 3:

https://archives.fedoraproject.org/pub/archive/fedora/linux/core/3/x86_64/os/Fedora/RPMS/bash-3.0-17.x86_64.rpm

One of the tags is a record_type_string_array related to ChangeLogs and some people seem to have used Latin-1 characters instead.

Trond Eivind Glomsr\xf8d <teg@redhat.com> 2.0.5a-10

Currently record_type_string_array is defined as follows:

  record_type_string_array:
    params:
      - id: num_values
        type: u4
    seq:
      - id: values
        type: strz
        repeat: expr
        repeat-expr: num_values

and the default encoding is UTF-8, so this will obviously not work. I don't know how I could fix this.

generalmimon commented 1 year ago

@armijnhemel:

and the default encoding is UTF-8, so this will obviously not work. I don't know how I could fix this.

Well, if there isn't a single character encoding we could specify in the .ksy, the "next best thing" is to downgrade to a byte array:

   record_type_string_array:
     params:
       - id: num_values
         type: u4
     seq:
       - id: values
-        type: strz
+        terminator: 0
         repeat: expr
         repeat-expr: num_values

A byte array is the implicit type in .ksy specs when no type is given but the field size is delimited by size, size-eos: true or terminator.

armijnhemel commented 1 year ago

@armijnhemel:

and the default encoding is UTF-8, so this will obviously not work. I don't know how I could fix this.

Well, if there's no one clear character encoding we could specify in the .ksy, the "next best thing" is to downgrade to a byte array:
   record_type_string_array:
     params:
       - id: num_values
         type: u4
     seq:
       - id: values
-        type: strz
+        terminator: 0
         repeat: expr
         repeat-expr: num_values
A byte array is the implicit type in .ksy specs when no type is given but the field size is delimited by size, size-eos: true or terminator.

I actually had been thinking about that and looked at the docs, but that seems to indicate that terminator was only for strings. Using a byte array and then processing the strings in an external script would work for me.

armijnhemel commented 1 year ago

@armijnhemel:

and the default encoding is UTF-8, so this will obviously not work. I don't know how I could fix this.

Well, if there's no one clear character encoding we could specify in the .ksy, the "next best thing" is to downgrade to a byte array:
   record_type_string_array:
     params:
       - id: num_values
         type: u4
     seq:
       - id: values
-        type: strz
+        terminator: 0
         repeat: expr
         repeat-expr: num_values
A byte array is the implicit type in .ksy specs when no type is given but the field size is delimited by size, size-eos: true or terminator.
I actually had been thinking about that and looked at the docs, but that seems to indicate that terminator was only for strings. Using a byte array and then processing the strings in an external script would work for me.

Thinking a bit more about this: probably this isn't a good idea, as \x00 can be part of a valid UTF-8 string.

armijnhemel commented 1 year ago

I found it easier to just work around it like this:

parse regularly (which will parse the vast majority of RPM files out there)
reparse if it fails with a copy of the RPM specification with the above change (byte array instead of strz)
decode all the strings to valid UTF-8 for some common encodings

This is cleaner than trying to fix it here.