Unicode is the standard encoding for text. There are thousands of glyphs, representing letters, characters, symbols and marks from a huge variety of scripts and contexts.
There are many repeated, variant or combined characters. These can be normalized to a subset of characters, using standard normalization algorithms.
These algorithms are generic: when used in a mathematical context, they might not apply an equivalence between two characters, or omit some information that would be useful.
So this project aims to compile a dictionary of mappings from less-common Unicode characters to the symbols conventionally used in linear mathematical expressions.
The motivation is to support more characters in the JME language used by Numbas. The JME grammar has the following kinds of token:
Any string of letter characters is acceptable for name tokens, but there are many equivalences that should be applied:
α
and alpha
.∞
and infinity
.x²
is equivalent to x^2
.√
is equivalent to sqrt
, or to operations that are conventionally typed in ASCII, e.g. ×
is equivalent to *
.Digit symbols should be normalized to the ASCII digits 0-9, where possible. Non-European scripts for representing numbers would need to be dealt with individually.
There are lots of varieties of brackets, which should normalize to the ASCII parentheses ()
, square brackets []
and curly brackets {}
.
There is a Jupyter notebook, unicode-math-mapping.ipynb
, which contains code for working through subsets of Unicode and producing mapping dictionaries.
There are some mappings that can be produced automatically, and some that had to be written out manually - these are defined by the .tsv
files in the root of this repository.
The mapping information is stored in .json
files in the final_data
directory.
Each of these files contains a single dictionary mapping single Unicode characters to an equivalent string, and an array of annotations, which are themselves ASCII strings.
For example, this is the entry in final_data/symbols.json
for 𝒚
, "MATHEMATICAL BOLD ITALIC SMALL Y":
"\ud835\udc9a": [
"y",
[
"BOLD",
"ITALIC"
]
],
There are seven files:
greek.json
- mapping Greek letters to their English names, e.g. α
to alpha
.letters.json
- mapping mathematical letters to their standard equivalents, with annotations.subscripts.json
- mapping superscript characters to their standard equivalents.superscripts.json
- mapping subscript characters to their standard equivalents.symbols.json
- mapping all sorts of mathematical symbols to common symbols, names, or sub-expressions. Some symbols are mapped to a string of the form not NAME
- you might have to do some processing to interpret these correctly, instead of just substituting the mapped string into the expression being parsed.punctuation.json
- mapping punctuation characters to symbols. This mapping could be combined with the one in symbols.json
.brackets.json
- mapping grouping characters (parentheses, square brackets and curly braces) to their standard equivalents.The mappings must be applied as part of the tokenisation step when parsing a mathematical expression.
It is not correct to do a global substitution of characters before parsing: for example, in the expression α = "α"
, the second occurrence of α
should be preserved because it's inside a string literal.
You will have to come up with a way of applying the produced mappings to a particular computer algebra system.
There were many decisions to make in producing the mapping of characters. I omitted most symbols relating to operations that are very unlikely to be used in an undergraduate maths course.
The function names for mappings were sometimes chosen arbitrarily - there might be standard names for these in some computer algebra systems.
If you have a character that is not dealt with by any of the mapping files or by the advice in the notebook, then you can:
If you use these mappings, please tell me!