limefrogyank / siunitx-pcc

0 stars 0 forks source link

Issues with parseNumber() #12

Open dpvc opened 8 months ago

dpvc commented 8 months ago

It is not clear why you are using a new TeX parser here:

https://github.com/limefrogyank/siunitx-pcc/blob/51e6c1e4d8e9429ee7451ffe2176db45501f6dbf/src/numMethods.ts#L243

You don't use any of its methods, and seem to just be using it to hold the parse string and the position in that string. But in that case, why not just use text as the string, and a local variable i as the position rather than instantiating an object whose properties you never use?

There are reasons to use a new TeX parser, however, but you aren't using them. For one, you can use subParser.GetNext() to get the next non-space character. Note that this gets you the full unicode character, even if it is a multi-character glyph in the input string. (You are dealing only with single characters, which means you can't handle Unicode characters outside of Plane 0. For example, many CJK ideograms are in higher Unicode planes, as are the Mathematical alphanumeric characters.)

So you could use something like

    while ((const char = subParser.GetNext())) {
      ...
    }

as the main loop. Then you won't need the lines

https://github.com/limefrogyank/siunitx-pcc/blob/51e6c1e4d8e9429ee7451ffe2176db45501f6dbf/src/numMethods.ts#L223-L224

(You also don't need line 219, since that is done in the constructor for the TexParser.)

You should also use the subParser to handle the parsing of the control sequence names. As with the regular expression in generateNumberMapping(), the treatment of control sequences is incorrect, here. In particular,

https://github.com/limefrogyank/siunitx-pcc/blob/51e6c1e4d8e9429ee7451ffe2176db45501f6dbf/src/numMethods.ts#L243

is not the only way that a control sequence can terminate. As I mentioned in the discussion of generateNumberMapping(), a control sequence consists of \ followed either by a run of one or more alphabetic characters (a-z and A-Z), or by a single non-alphabetic character (e.g., \1 or \*). You are allowing runs of any non-space, non-\ characters, which is incorrect.

There is already a method of subParser that handles control sequence names (and the removal of the optional space that follows them), namely subParser.GetCS(), which returns the control sequence name without the initial \ (so you would need to add that back in yourself).

In this way,

https://github.com/limefrogyank/siunitx-pcc/blob/51e6c1e4d8e9429ee7451ffe2176db45501f6dbf/src/numMethods.ts#L239-L247

could become just

       const macro = '\\' + subParser.GetCS();

That way, you don't have to worry about the rules for control sequence names yourself, and if there are any fixes made to how MathJax parses them, you get them automatically instead of having to update your code as well.

I notice that in a lot of places, here and in other files, you use == and != where === and !== would be more efficient. You should always use the three-character versions when possible, like when comparing strings or numbers, and only use the double character ones when comparing things that really may need type conversion, like x == null when x could be undefined and you still want that to match.

The lines

https://github.com/limefrogyank/siunitx-pcc/blob/51e6c1e4d8e9429ee7451ffe2176db45501f6dbf/src/numMethods.ts#L226-L237

and

https://github.com/limefrogyank/siunitx-pcc/blob/51e6c1e4d8e9429ee7451ffe2176db45501f6dbf/src/numMethods.ts#L249-L260

are similar enough that you can collapse them into a single usage, as in

        let token;
        while ((token = subParser.GetNext())) {
                if (token === '\\') {
                        token += subParser.GetCS();
                }
                if (mapping.has(token)) {
                        const func = mapping.get(token);
                        if (typeof func == 'function') {
                               (func as CharNumFunction)(token, num);
                        } else {
                                if (num.whole === '' && num.decimal === '') {
                                       (func as Map<string, CharNumFunction>).get('inputSigns')?.(token, num);
                                } else {
                                        (func as Map<string, CharNumFunction>).get('inputUncertaintySigns')?.(token, num);
                                }
                        }
                }
        }

In fact, you could simplify this further by handling the uncertainty mapping differently. Instead of having the inputUncertaintySigns store a sub-map when the character already is mapped, you could have it store a CharNumFunction that does the else code from above. That is,

https://github.com/limefrogyank/siunitx-pcc/blob/51e6c1e4d8e9429ee7451ffe2176db45501f6dbf/src/numMethods.ts#L186-L194

could be

                let parser: CharNumFunction = parseUncertaintySigns;
                if (parseMap.has(tempArray[0])) {
                        const inputSigns = parseMap.get(tempArray[0]) as CharNumFunction;
                        parser =  function (macro, num) {
                                 (num.whole === '' && num.decimal === '' ? inputSigns : parseUncertaintySigns)(macro, num);
                        }
                }
                parseMap.set(tempArray[0], parser);

so that you always have a CharNumFunction in parseMap, and then just use

        let token;
        while ((token = subParser.GetNext())) {
                if (token === '\\') {
                        token += subParser.GetCS();
                }
                mapping.get(token)?.(token, num);
        }

Of course, you then can change

https://github.com/limefrogyank/siunitx-pcc/blob/51e6c1e4d8e9429ee7451ffe2176db45501f6dbf/src/numMethods.ts#L162-L163

to

export function generateNumberMapping(options: INumParseOptions): Map<string, CharNumFunction> {
        const parseMap = new Map<string, CharNumFunction>();

and avoid having to coerce the type of mapping.get() above.

One of the differences between your implementation and LaTeX's is that in LaTeX, an invalid number will throw an error, e.g., \num{12a} produces the error ! Package siunitx Error: Invalid number '12a'. Yours will silently produce 12. You may want to use mapping.get(char) above being null to produce such an error. (Similarly in your other parsers.)

Finally, the lines

https://github.com/limefrogyank/siunitx-pcc/blob/51e6c1e4d8e9429ee7451ffe2176db45501f6dbf/src/numMethods.ts#L210-L214

can be simplified by chaining the replace() calls, as in

        text = text.replace('<<', '\\ll')
                  .replace('>>', '\\gg')
                  .replace('<=', '\\le')
                  .replace('>=', '\\ge')
                  .replace('+-', '\\pm');

so text doesn't have to be replaced each time.

limefrogyank commented 8 months ago

Thanks for your expertise here with characters. I had assumed javascript was high-level enough to abstract all the multibyte character stuff, but I guess not. This is giving me flashbacks of learning C++ and wchar_t for windows.

dpvc commented 8 months ago

Yes, it is a pain to have to worry about that. Strings in javascript are UTF-16 encoded, which means 16-bit characters. This is enough to handle the first Unicode plane (plane 0), but Plan 1 and above require two UTF-16 "characters". There are some features that split strings by Unicode code point, and String.fromCodePoint(), and things like that for handling Planes 1 and above, but you do have to know about them and use them, as indexing into a string is indexing into the UTF-16 array. It is not as easy as one would like.

limefrogyank commented 8 months ago

Note to self to doublecheck side-effects...