Closed sangwinc closed 1 year ago
In the "Punctuation characters" section of the notebook, I say that the various brackets and parentheses all seem to normalise to their ASCII equivalents, so you don't need to treat them specially. The table in that section only lists characters that don't normalise to anything in the "categories already understood by Numbas JME" defined at the top of the notebook (at the moment, that's '".{}?/\n:&;|^>=<-+*#!(),[]
Ok, we're not getting them to normalise! I'm planning to add some from here: https://www.fileformat.info/info/unicode/category/Ps/list.htm
Is that because you can't do normalisation, or you just choose not to? I'm happier relying on a standard algorithm.
If it'd help, I can add a bit to the notebook to make it produce lists of characters that normalise to each of the ASCII ones in the list I gave above.
Well, we've tried normalisation and it doesn't work. In particular https://www.php.net/manual/en/class.normalizer.php with Normalizer::FORM_KC transforms x² to x*2, and doesn't solve the brackets issue! The other options for PHP don't seem to work for me either.
I don't think you should normalise an entire expression string before parsing it: as you've shown, you lose important information like the subscriptness of ²
.
(To be clear, "x²"
normalises to "x2"
, and I guess that you then insert the 'missing' asterisk as part of your existing filters)
The Numbas JME tokeniser first matches a prefix of the string which matches a regular expression corresponding to a particular token type, and then might normalise just that substring. So we have a separate regex for subscripts, which will insert ^
and parentheses as needed, and apply the superscript replacements from this project.
... and now that I look at it, the list of parentheses is still hardcoded, but not in this project! The JME parser has the following strings, which are used in the regex to match punctuation tokens:
/** Characters representing a left parenthesis.
*
* @type {Array.<string>}
*/
left_parentheses: "(❨❪⟮﹙(﴾⦅⦅",
/** Characters representing a right parenthesis.
*
* @type {Array.<string>}
*/
right_parentheses: ")﹚)❩❫﴿⟯⦆⦆",
once those are matched, they're normalised using NFKD. So you're right that some important information was missing here!
In 29b5e8a I've added a file brackets.json
mapping the bracketing characters that don't already normalise under NFKD.
So you can detect bracket characters by looking for character classes Ps
and Po
, normalise with NFKD, then apply the mapping from brackets.json if necessary.
I can't find where the various alternative parentheses are defined?