numbas / unicode-math-normalization

Data for normalizing mathematical expressions written in Unicode
Apache License 2.0
5 stars 2 forks source link

fullwidth left parenthesis (U+FF08) #1

Closed sangwinc closed 1 year ago

sangwinc commented 1 year ago

I can't find where the various alternative parentheses are defined?

christianp commented 1 year ago

In the "Punctuation characters" section of the notebook, I say that the various brackets and parentheses all seem to normalise to their ASCII equivalents, so you don't need to treat them specially. The table in that section only lists characters that don't normalise to anything in the "categories already understood by Numbas JME" defined at the top of the notebook (at the moment, that's '".{}?/\n:&;|^>=<-+*#!(),[]

sangwinc commented 1 year ago

Ok, we're not getting them to normalise! I'm planning to add some from here: https://www.fileformat.info/info/unicode/category/Ps/list.htm

christianp commented 1 year ago

Is that because you can't do normalisation, or you just choose not to? I'm happier relying on a standard algorithm.

If it'd help, I can add a bit to the notebook to make it produce lists of characters that normalise to each of the ASCII ones in the list I gave above.

sangwinc commented 1 year ago

Well, we've tried normalisation and it doesn't work. In particular https://www.php.net/manual/en/class.normalizer.php with Normalizer::FORM_KC transforms x² to x*2, and doesn't solve the brackets issue! The other options for PHP don't seem to work for me either.

christianp commented 1 year ago

I don't think you should normalise an entire expression string before parsing it: as you've shown, you lose important information like the subscriptness of ². (To be clear, "x²" normalises to "x2", and I guess that you then insert the 'missing' asterisk as part of your existing filters)

The Numbas JME tokeniser first matches a prefix of the string which matches a regular expression corresponding to a particular token type, and then might normalise just that substring. So we have a separate regex for subscripts, which will insert ^ and parentheses as needed, and apply the superscript replacements from this project.

... and now that I look at it, the list of parentheses is still hardcoded, but not in this project! The JME parser has the following strings, which are used in the regex to match punctuation tokens:

    /** Characters representing a left parenthesis.
     *
     * @type {Array.<string>}
     */
    left_parentheses: "(❨❪⟮﹙(﴾⦅⦅",

    /** Characters representing a right parenthesis.
     *
     * @type {Array.<string>}
     */
    right_parentheses: ")﹚)❩❫﴿⟯⦆⦆",

once those are matched, they're normalised using NFKD. So you're right that some important information was missing here!

christianp commented 1 year ago

In 29b5e8a I've added a file brackets.json mapping the bracketing characters that don't already normalise under NFKD.

So you can detect bracket characters by looking for character classes Ps and Po, normalise with NFKD, then apply the mapping from brackets.json if necessary.