jpd236 / kotwords

Collection of crossword puzzle file format converters and other utilities, written in Kotlin.
Apache License 2.0
25 stars 6 forks source link

Generalize mapping of non-Cp1252 characters to Cp1252 #2

Closed jpd236 closed 3 years ago

jpd236 commented 5 years ago

Formats other than Across Lite (i.e. JPZ) use different encodings. Many characters which can be encoded with these encodings do not have an equivalent in the more-limited Cp1252 charset.

So far there have been ad-hoc workarounds, such as mapping the unicode star to an asterisk for JPZ puzzles (see e22f2aba85bb83e24b135c58c05b605a268ab65c), or, in an upcoming commit, replacing ł with the accent-less l for PuzzleMe puzzles.

Instead, we should centralize this logic as part of the Charset encoding process. When either validating (in Crossword#requireEncodableString) or actually encoding (in AcrossLite#writeNullTerminatedString), we should handle unmappable errors by first attempting a substitution. Accented characters could be separated into accent + equivalent, with the accent stripped out, by using Java's Normalizer interface. We just need to take caution not to strip all accented characters, as many are still encodable in Cp1252; we only want to strip those which cannot be represented.

jpd236 commented 3 years ago

This may not be worth it with the introduction of UTF-8 support for PUZ files in https://github.com/jpd236/kotwords/commit/cf9b9c7a543c587026fb7210c5a1ee1c78f7b1b2.

The only reason we might want this is if we want to explicitly write v1.4 Across Lite instead of v2.0, e.g. because the .puz file is meant to be consumed by an app which doesn't support 2.0 yet. In this case, special characters are still substituted.