Open sergiocorreia opened 8 years ago
Awesome suggestions, thanks Sergio! Should be pretty easy to implement and the W3C unicode mapping xml is particularly useful for generalizing to support more characters.
An update on this, it's of course more complicated than originally expected.
In particular, one of my primary goals for my personal use is to allow implicit conversion to math mode when needed. I use constructs like $z\simeq{}6$
frequently and I find them hard to parse visually. I'd much prefer to write z≃6. The W3C xml provides a helpful mode attribute (math|text|mixed) per-character:
<character id="U000A1" dec="161" mode="text" type="punctuation">
<afii>00A1</afii>
<latex>\textexclamdown</latex>
...
<description>INVERTED EXCLAMATION MARK</description>
</character>
<character id="U02243" dec="8771" mode="math" type="relation">
<afii>EF77</afii>
<latex>\simeq</latex>
...
<description>ASYMPTOTICALLY EQUAL TO</description>
</character>
However this doesn't seem to map nicely to Latex math mode/not math mode. In particular, non-breaking space (U000A0) is flagged as math mode, which is definitely not true in Latex.
Second, there are some characters that require separate Latex escapes depending on whether they are used in math mode:
<latex>\^{A}</latex>
<mathlatex>\hat{A}</mathlatex>
Fixing this is a little simpler if the node type is correct (math vs. str). Just need to have a separate mathlatex dict, grab from it first if in math mode, otherwise from standard text latex dict first.
Makes a lot of sense.
Let me know if you need help debugging or tuning it up, I can test how it performs in reasonably-sized documents
This filter seems potentially very useful but will probably require adding more symbols, at which point it could become quite slow. However, I think there might be a way to have both more symbols and speed.
Symbols
One idea is to add the symbols discussed here: http://stackoverflow.com/a/2356160/3977107
It's an XML file with the entire unicode-html-latex map, which would allow you to even generalize it to deal with html. The entries are quite verbose but can be parsed to a python map (maybe even pickled).
Example: (notice the
dec
attribute, which is equal toord("Θ")
, and the latex subentry "\Theta")Speed
In a large document there are probably thousands of elements with text attributes, and there are thousands of unicode chars that we want to replace, so the code in
unicode_replace()
would be too slow.The example here could give an alternative: http://stackoverflow.com/a/196392/3977107
First, detect if there are unicode characters at all (if not, just pass):
Even faster but less legible:
Now, if there are unicode characters, you can do something like:
Alternative, you can just import pylatexenc and call its
utf8tolatex
function. That function is not very efficient, but the dict defined at the beginning might be useful.Cheers, S