josdejong / mathjs

An extensive math library for JavaScript and Node.js
https://mathjs.org
Apache License 2.0
14.32k stars 1.24k forks source link

Support for unicode characters #265

Closed josdejong closed 8 years ago

josdejong commented 9 years ago

Would be nice to have support for unicode characters in the expression parser, so you can define variables containing special characters.

balagge commented 9 years ago

:) I saw this comment just now. Man, this would be VERY useful for me, as I use rendered formulas (MathML) as input, which may contain many different characters. So now I have to create a 1-to-1 mapping between names containing (for example) greek letters and latin variable names. This leads to having two names for the same variable, which is always a headache.

However, this feature is only useful (for me at least) if it also supports the Unicode "mathematical alphanumeric symbols" block. This block is unfortunately not in the BMP, but SMP part of Unicode, and cannot be represented in UTF-16 without using surrogate pairs. Javascript is not UTF-32 enabled, you start to manipulate pairs of 16bit characters instead of single 32bit character in Javascript.

Maybe that is not a big issue in mathjs as you don't have to inspect variable names too much. However, it is a problem in the parser, because you have to know when you encounter such a pair, you have to advance the parser position by 2 instead of one, and treat the pair as a single character.

josdejong commented 9 years ago

Thanks for your feedback. I haven't yet looked what would be needed to support unicode, but I know that there are tricky cases where a single unicode characters is represented by two characters.

pedroteixeira commented 9 years ago

:+1: me too would realy find usefull to allow unicode in symbol names.

josdejong commented 9 years ago

I've added support for unicode characters. I've been a bit conservative here, allowing latin letters with accents and greek letters now. What do you think, would that be enough for practical usage?

pedroteixeira commented 9 years ago

That's great! My case was exactly latin and greek letters :)

balagge commented 9 years ago

Great! But as I have commented above, the Mathematical Alphanumeric Symbols block would be needed for me as well, if possible. These are special characters designed especially for use in math identifier names. The rationale is that in everyday math text the character typesetting is semantically important (e.g. a bold letter may mean a vector, etc.). No matter what the intended purpose of using it, a bold variable name is considered a different variable than a normal (which is, by the way, italic by default). Encoding this information in the actual name string saves a lot of work that must be done externally otherwise. Also, it ensures that no matter where you "take" that name, it will still contain this additional information.

I don't think there is a vast problem there. Surrogate pairs (those that encode non-BMP characters like the ones I'd like to have) seem to be supported in variable names, property names, strings, etc. in Javascript, so you won't even notice the difference.

The only place where care is needed is if you manipulate a string by character position, or taking the length of a string. There these pairs will show up as "two characters". But in any other way those two characters behave like ordinary characters. They just should / must not be separated because the result of separating them is unpredictable.

The ranges are D800-DBFF (high surrogate block) and DC00-DFFF (low surrogate block). Enabling these blocks completely would mean that any Unicode character in the Supplementary Plane (U+10000 and above, a.k.a 'Astral') can be encoded and is allowed. So maybe you want to limit that to the actual mathematical characters, U+1D400 ... U+1D7FF. Which are (as a pair): [xD835, xDC00] and [xD835, xDFFF]. So this would mean that a single high surrogate (xD835), and the complete low surrogate block should be enabled.

This would help me a lot :)

josdejong commented 9 years ago

@balagge sure, we will add these blocks of unicode too. Thanks for looking them up.

balagge commented 9 years ago

... I forgot to mention that some of the math characters are NOT in the range given above (because they existed previously and are left in their original position instead of duplicating them in the new math range). Also, 4 additional code points are not used (reserved) These are:

Valid code point Character Name Invalid code point Comment
U+210E planck constant U+1D455 despite the name ("planck constant"), it is also used as "mathematical italic small h"
U+212C script capital B U+1D49D "script ..." = "mathematical script ..."
U+2130 script capital E U+1D4A0
U+2131 script capital F U+1D4A1
U+210B script capital H U+1D4A3
U+2110 script capital I U+1D4A4
U+2112 script capital L U+1D4A7
U+2133 script capital M U+1D4A8
U+211B script capital R U+1D4AD
U+212F script small e U+1D4BA
U+210A script small g U+1D4BC
U+2134 script small o U+1D4C4
U+212D black-letter capital C U+1D506 "black-letter ... " = "mathematical fraktur ..."
U+210C black-letter capital H U+1D50B
U+2111 black-letter capital I U+1D50C
U+211C black-letter capital R U+1D515
U+2128 black-letter capital Z U+1D51D
U+2102 double-struck capital C U+1D53A "double-struck ... " = "mathematical double-struck ..."
U+210D double-struck capital H U+1D53F
U+2115 double-struck capital N U+1D545
U+2119 double-struck capital P U+1D547
U+211A double-struck capital Q U+1D548
U+211D double-struck capital R U+1D549
U+2124 double-struck capital Z U+1D551
U+1D6A6 reserved
U+1D6A7 reserved
U+1D7CC reserved
U+1D7CD reserved

Note: "Valid code point" should be allowed, these are older BMP characters. "Invalid code point" is where the new range contains a "hole" of unused / non-existing characters. These should NOT be accepted.

josdejong commented 9 years ago

Thanks @balagge . I don't expect to be able to implement this within next week. Feel free to create a pull request adding all additional unicode characters (in this commit https://github.com/josdejong/mathjs/commit/33370bfe73dccb51a15018c8e79c6030079442bf you can see where to add the characters and how to unit test).

josdejong commented 8 years ago

@balagge I've added support for mathematical symbols. It's in the develop branch, and you can have a look at the implementation:

https://github.com/josdejong/mathjs/blob/develop/lib/expression/parse.js#L368-L403

josdejong commented 8 years ago

The mathematical symbols are now supported in the just released v2.4.0. It would be great if you could give this a try, @balagge .