transpect / docx2hub

Converts Microsoft docx to flat hub XML
BSD 2-Clause "Simplified" License
27 stars 15 forks source link

possibly problematic mapping changes, for ex. Wingdings F0E0→U+1F86A #9

Closed gimsieke closed 7 years ago

gimsieke commented 7 years ago

Wingdings F0E0 used to be mapped to a plain right arrow, U+2192. This was done because in most cases, authors use these similar glyphs inconsistently. They don’t care whether they select a right arrow from symbol or from Wingdings. In some cases, as in differently sized box letters, the differences matter. But in most cases, the more or less fancy arrows, boxes, and circles of Wingdings should be converted to the most common Unicode symbol. In the case of F0E0, this is → rather than 🡪. The newly introduced mapping is highly problematic for our doc→docx conversions that use Cambria for the mapped glyphs by default (unless declared otherwise, in the linked case: use Segoe UI Symbol instead of Cambria; in the case of 🡪 U+1F86A, this glyph doesn’t seem to exist in the default fonts that all users of recent MS Office versions have installed). Most of these mappings have not been done for purity, they have been done for legacy doc file migration. All the replacement font instructions have been eliminated. We cannot use the new mappings in production. You need to introduce a mapping representation that allows us to map either to modern MS Office fonts or to exact Unicode match (if available). This is a very sensitive area. We only have poor and accidental test coverage for the mappings that are used in doc→docx conversions. Therefore the new mappings will be used in conversions because they appear to be compatible to the test system. We need to fix this very quickly or roll back to the the old mappings and create a branch for the new mappings and an option to select MS Office font compatiblity mappings.

mkraetke commented 7 years ago

This issue was resolved by commit 62999260c2f8055087e32aef5de40ca86c1efe5a. The main pipeline docx2hub,xpl allows the parameter charmap-policy to be set. For example, if the value of the parameter is mycharmap, the pipeline looks for an attribute named @char-mycharmap in the fontmap. This attribute is preferred before the value of the char attribute. If no @char-mycharmap-attribute exists, the @char-attribute is used as fallback. In the previous mappings this affected only some Wingdings barb arrows and the diamond operator in Symbol.