mmechtley / pandoc-filter-test

Playing with pandoc filters to translate unicode literals
2 stars 0 forks source link

Generalizing and speeding it up #1

Open sergiocorreia opened 8 years ago

sergiocorreia commented 8 years ago

This filter seems potentially very useful but will probably require adding more symbols, at which point it could become quite slow. However, I think there might be a way to have both more symbols and speed.

Symbols

One idea is to add the symbols discussed here: http://stackoverflow.com/a/2356160/3977107

It's an XML file with the entire unicode-html-latex map, which would allow you to even generalize it to deal with html. The entries are quite verbose but can be parsed to a python map (maybe even pickled).

Example: (notice the dec attribute, which is equal to ord("Θ"), and the latex subentry "\Theta")

<character id="U00398" dec="920" mode="math" type="alphabetic">
<afii>264B</afii>
<latex>\Theta</latex>
<Elsevier grid="ceq" ent="Theta">
<desc>theta (capital) -- Greek --</desc>
</Elsevier>
<AMS>\Theta</AMS>
<APS>Theta</APS>
<AIP>THgr</AIP>
<IEEE>\Theta</IEEE>
<Wolfram>CapitalTheta</Wolfram>
<entity id="THgr" set="ISOGRK1">
<desc>capital Theta, Greek</desc>
</entity>
<entity id="THgr" set="8879-isogrk1">
<desc>=capital Theta, Greek</desc>
</entity>
<entity id="Theta" set="html4-symbol">
<desc>greek capital letter theta</desc>
</entity>
<entity id="Theta" set="8879-isogrk3">
<desc>=capital Theta, Greek</desc>
</entity>
<entity id="Theta" set="9573-13-isogrk3">
<desc>/Theta capital Theta, Greek</desc>
</entity>
<font name="hlcrm" pos="2"/>
<description>GREEK CAPITAL LETTER THETA</description>
</character>

Speed

In a large document there are probably thousands of elements with text attributes, and there are thousands of unicode chars that we want to replace, so the code in unicode_replace() would be too slow.

The example here could give an alternative: http://stackoverflow.com/a/196392/3977107

First, detect if there are unicode characters at all (if not, just pass):

def has_unicode(text):
    return any(ord(c) > 127 for c in text)

Even faster but less legible:

def has_unicode(text):
    return any(c > "\x7F" for c in text)

Now, if there are unicode characters, you can do something like:

def translate(text):
    return ''.join(unicode2latex.get(c, c) for c in text)

Alternative, you can just import pylatexenc and call its utf8tolatex function. That function is not very efficient, but the dict defined at the beginning might be useful.

Cheers, S

mmechtley commented 8 years ago

Awesome suggestions, thanks Sergio! Should be pretty easy to implement and the W3C unicode mapping xml is particularly useful for generalizing to support more characters.

mmechtley commented 8 years ago

An update on this, it's of course more complicated than originally expected. In particular, one of my primary goals for my personal use is to allow implicit conversion to math mode when needed. I use constructs like $z\simeq{}6$ frequently and I find them hard to parse visually. I'd much prefer to write z≃6. The W3C xml provides a helpful mode attribute (math|text|mixed) per-character:

<character id="U000A1" dec="161" mode="text" type="punctuation">
<afii>00A1</afii>
<latex>\textexclamdown</latex>
...
<description>INVERTED EXCLAMATION MARK</description>
</character>
<character id="U02243" dec="8771" mode="math" type="relation">
<afii>EF77</afii>
<latex>\simeq</latex>
...
<description>ASYMPTOTICALLY EQUAL TO</description>
</character>

However this doesn't seem to map nicely to Latex math mode/not math mode. In particular, non-breaking space (U000A0) is flagged as math mode, which is definitely not true in Latex.

Second, there are some characters that require separate Latex escapes depending on whether they are used in math mode:

<latex>\^{A}</latex>
<mathlatex>\hat{A}</mathlatex>

Fixing this is a little simpler if the node type is correct (math vs. str). Just need to have a separate mathlatex dict, grab from it first if in math mode, otherwise from standard text latex dict first.

sergiocorreia commented 8 years ago

Makes a lot of sense.

Let me know if you need help debugging or tuning it up, I can test how it performs in reasonably-sized documents