vatplanner / dataformats-vatsim-public

library for parsing and processing publicly accessible VATSIM data formats
MIT License
1 stars 0 forks source link

automatically detect character sets #6

Open dneuge opened 4 years ago

dneuge commented 4 years ago

Traditionally, data files are a mixture of various character sets. The parser expects ISO-8859-1 decoded input as it is the most used character set on data files and can easily be reinterpreted without loss (plain single-byte 8-bit characters, fully defined). Some client lines were already encoded differently in the past, for example in UTF-8 or KOI-8.

Currently (April 2020) that mixture appears to have gone worse by the whole data file now also being encoded in UTF-8 leading to a mess of multiple encodings layered on top of each other. This appears to be a good time to have another look into automatic character set detection and re-interpretation both on file-level (restoring original encodings) and on individual lines (introducing compatibility with UTF-8, KOI-8 etc.).

dneuge commented 4 years ago

Continue on branch feature_charsets.

dneuge commented 3 years ago

This might have been solved for JSON v3 but remains an issue in case problematic fields are batch-processed in legacy format.