Support for unicode characters

josdejong commented 9 years ago

Would be nice to have support for unicode characters in the expression parser, so you can define variables containing special characters.

balagge commented 9 years ago

:) I saw this comment just now. Man, this would be VERY useful for me, as I use rendered formulas (MathML) as input, which may contain many different characters. So now I have to create a 1-to-1 mapping between names containing (for example) greek letters and latin variable names. This leads to having two names for the same variable, which is always a headache.

However, this feature is only useful (for me at least) if it also supports the Unicode "mathematical alphanumeric symbols" block. This block is unfortunately not in the BMP, but SMP part of Unicode, and cannot be represented in UTF-16 without using surrogate pairs. Javascript is not UTF-32 enabled, you start to manipulate pairs of 16bit characters instead of single 32bit character in Javascript.

Maybe that is not a big issue in mathjs as you don't have to inspect variable names too much. However, it is a problem in the parser, because you have to know when you encounter such a pair, you have to advance the parser position by 2 instead of one, and treat the pair as a single character.

josdejong commented 9 years ago

Thanks for your feedback. I haven't yet looked what would be needed to support unicode, but I know that there are tricky cases where a single unicode characters is represented by two characters.

pedroteixeira commented 9 years ago

:+1: me too would realy find usefull to allow unicode in symbol names.

josdejong commented 9 years ago

I've added support for unicode characters. I've been a bit conservative here, allowing latin letters with accents and greek letters now. What do you think, would that be enough for practical usage?

pedroteixeira commented 9 years ago

That's great! My case was exactly latin and greek letters :)

balagge commented 9 years ago

Great! But as I have commented above, the Mathematical Alphanumeric Symbols block would be needed for me as well, if possible. These are special characters designed especially for use in math identifier names. The rationale is that in everyday math text the character typesetting is semantically important (e.g. a bold letter may mean a vector, etc.). No matter what the intended purpose of using it, a bold variable name is considered a different variable than a normal (which is, by the way, italic by default). Encoding this information in the actual name string saves a lot of work that must be done externally otherwise. Also, it ensures that no matter where you "take" that name, it will still contain this additional information.

I don't think there is a vast problem there. Surrogate pairs (those that encode non-BMP characters like the ones I'd like to have) seem to be supported in variable names, property names, strings, etc. in Javascript, so you won't even notice the difference.

The only place where care is needed is if you manipulate a string by character position, or taking the length of a string. There these pairs will show up as "two characters". But in any other way those two characters behave like ordinary characters. They just should / must not be separated because the result of separating them is unpredictable.

The ranges are D800-DBFF (high surrogate block) and DC00-DFFF (low surrogate block). Enabling these blocks completely would mean that any Unicode character in the Supplementary Plane (U+10000 and above, a.k.a 'Astral') can be encoded and is allowed. So maybe you want to limit that to the actual mathematical characters, U+1D400 ... U+1D7FF. Which are (as a pair): [xD835, xDC00] and [xD835, xDFFF]. So this would mean that a single high surrogate (xD835), and the complete low surrogate block should be enabled.

This would help me a lot :)

josdejong commented 9 years ago

@balagge sure, we will add these blocks of unicode too. Thanks for looking them up.

balagge commented 9 years ago

... I forgot to mention that some of the math characters are NOT in the range given above (because they existed previously and are left in their original position instead of duplicating them in the new math range). Also, 4 additional code points are not used (reserved) These are:

Valid code point	Character	Name	Invalid code point	Comment
U+210E	ℎ	planck constant	U+1D455	despite the name ("planck constant"), it is also used as "mathematical italic small h"
U+212C	ℬ	script capital B	U+1D49D	"script ..." = "mathematical script ..."
U+2130	ℰ	script capital E	U+1D4A0
U+2131	ℱ	script capital F	U+1D4A1
U+210B	ℋ	script capital H	U+1D4A3
U+2110	ℐ	script capital I	U+1D4A4
U+2112	ℒ	script capital L	U+1D4A7
U+2133	ℳ	script capital M	U+1D4A8
U+211B	ℛ	script capital R	U+1D4AD
U+212F	ℯ	script small e	U+1D4BA
U+210A	ℊ	script small g	U+1D4BC
U+2134	ℴ	script small o	U+1D4C4
U+212D	ℭ	black-letter capital C	U+1D506	"black-letter ... " = "mathematical fraktur ..."
U+210C	ℌ	black-letter capital H	U+1D50B
U+2111	ℑ	black-letter capital I	U+1D50C
U+211C	ℜ	black-letter capital R	U+1D515
U+2128	ℨ	black-letter capital Z	U+1D51D
U+2102	ℂ	double-struck capital C	U+1D53A	"double-struck ... " = "mathematical double-struck ..."
U+210D	ℍ	double-struck capital H	U+1D53F
U+2115	ℕ	double-struck capital N	U+1D545
U+2119	ℙ	double-struck capital P	U+1D547
U+211A	ℚ	double-struck capital Q	U+1D548
U+211D	ℝ	double-struck capital R	U+1D549
U+2124	ℤ	double-struck capital Z	U+1D551
			U+1D6A6	reserved
			U+1D6A7	reserved
			U+1D7CC	reserved
			U+1D7CD	reserved

Note: "Valid code point" should be allowed, these are older BMP characters. "Invalid code point" is where the new range contains a "hole" of unused / non-existing characters. These should NOT be accepted.

josdejong commented 9 years ago

Thanks @balagge . I don't expect to be able to implement this within next week. Feel free to create a pull request adding all additional unicode characters (in this commit https://github.com/josdejong/mathjs/commit/33370bfe73dccb51a15018c8e79c6030079442bf you can see where to add the characters and how to unit test).

josdejong commented 8 years ago

@balagge I've added support for mathematical symbols. It's in the develop branch, and you can have a look at the implementation:

https://github.com/josdejong/mathjs/blob/develop/lib/expression/parse.js#L368-L403

josdejong commented 8 years ago

The mathematical symbols are now supported in the just released v2.4.0. It would be great if you could give this a try, @balagge .

josdejong / mathjs

Support for unicode characters #265