Markdown reader: allow for more complex transliterations in ascii_identifiers or another extension

Wolf-SO commented 7 years ago

As mentioned in https://github.com/jgm/pandoc/issues/807#issuecomment-310831480 and https://github.com/jgm/pandoc/issues/807#issuecomment-310831794 ,

I'd like to have the option to transliterate non-ASCII chars into multiple ASCII chars. This would be especially helpful for German Umlauts since there exists already a (classical) convention. I was looking into and working on src/Text/Pandoc/Asciify.hs but I'm not sure if it wasn't better to provide a new extension instead of modifying the existing one, since two-letter replacements would expand the HTML ids which could break something.

I tried to expand the map and change it to Map Char String (added the transliteration of the letter ß as ,('\223',"ss")), this is how this may look like (just a fragment containing the 7 German transliterations).

toAsciiStr :: Char -> String
toAsciiStr c | isAscii c = [c]
             | otherwise = M.findWithDefault "" c asciiMap'

asciiMap' :: M.Map Char String
asciiMap' = M.fromList
  [('\192',"A")
  ,('\193',"A")
  ,('\194',"A")
  ,('\195',"A")
  ,('\196',"Ae")
  ,('\197',"A")
  ,('\199',"C")
  ,('\200',"E")
  ,('\201',"E")
  ,('\202',"E")
  ,('\203',"E")
  ,('\204',"I")
  ,('\205',"I")
  ,('\206',"I")
  ,('\207',"I")
  ,('\209',"N")
  ,('\210',"O")
  ,('\211',"O")
  ,('\212',"O")
  ,('\213',"O")
  ,('\214',"Oe")
  ,('\217',"U")
  ,('\218',"U")
  ,('\219',"U")
  ,('\220',"Ue")
  ,('\221',"Y")
  ,('\223',"ss")
  ,('\224',"a")
  ,('\225',"a")
  ,('\226',"a")
  ,('\227',"a")
  ,('\228',"ae")
  ,('\229',"a")
  ,('\231',"c")
  ,('\232',"e")
  ,('\233',"e")
  ,('\234',"e")
  ,('\235',"e")
  ,('\236',"i")
  ,('\237',"i")
  ,('\238',"i")
  ,('\239',"i")
  ,('\241',"n")
  ,('\242',"o")
  ,('\243',"o")
  ,('\244',"o")
  ,('\245',"o")
  ,('\246',"oe")
  ,('\249',"u")
  ,('\250',"u")
  ,('\251',"u")
  ,('\252',"ue")
  ,('\253',"y")
...

Wolf-SO commented 7 years ago

above toAsciiStr function is obviously incompatible with Parsing.hs, line 1205, because the id is built via

catMaybes $ map toAsciiChar id'

...but I'm very new to Haskell and maybe this is easy to change...

jgm commented 7 years ago

See also #2821 for a request touching a related part of the code. Note the GitHub generates %C3%9F in the identifier for the ß (so they are UTF8-encoding, then URL-encoding the octets).

Wolf-SO commented 7 years ago

@jgm What about swithcing to a two-stage approach? #2821 seems to suggest to be related to the output format. And this issue is also not only about input. HTML5 supports a superset of HTML4 ids. Wouldn't it be better to first read a "unified Pandoc id" (without leading numbering but including -) and later to output an id that is compatible with the required HTML version?

It seems, that the labels should also include [writer] and [format:HTML].

Gullumluvl commented 4 years ago

Sorry to revive this thread, if there is to be any update on this, as a French I would add to the list:

œ: ('\339',"oe")
Œ: ('\338',"OE")
æ: ('\230',"ae")
Æ: ('\198',"AE")
(+all accentuated versions of them...)

Or maybe those are directly fit for the Asciify module (they are currently dropped when using +ascii_identifiers in input format) ?

Icelandic has þ (thorn) '\254', transliterated info "th"...

Maybe there is a resource to find unicode characters actually representing several characters? Such as

½ : 189 (U+00BD)
¼ : 188 (U+00BC)
¾ : 190 (U+00BE)
™ : 8482 (U+2122)

jgm / pandoc

Markdown reader: allow for more complex transliterations in ascii_identifiers or another extension #3757