Open dpvc opened 8 months ago
Thanks for your expertise here with characters. I had assumed javascript was high-level enough to abstract all the multibyte character stuff, but I guess not. This is giving me flashbacks of learning C++ and wchar_t for windows.
Yes, it is a pain to have to worry about that. Strings in javascript are UTF-16 encoded, which means 16-bit characters. This is enough to handle the first Unicode plane (plane 0), but Plan 1 and above require two UTF-16 "characters". There are some features that split strings by Unicode code point, and String.fromCodePoint()
, and things like that for handling Planes 1 and above, but you do have to know about them and use them, as indexing into a string is indexing into the UTF-16 array. It is not as easy as one would like.
Note to self to doublecheck side-effects...
It is not clear why you are using a new TeX parser here:
https://github.com/limefrogyank/siunitx-pcc/blob/51e6c1e4d8e9429ee7451ffe2176db45501f6dbf/src/numMethods.ts#L243
You don't use any of its methods, and seem to just be using it to hold the parse string and the position in that string. But in that case, why not just use
text
as the string, and a local variablei
as the position rather than instantiating an object whose properties you never use?There are reasons to use a new TeX parser, however, but you aren't using them. For one, you can use
subParser.GetNext()
to get the next non-space character. Note that this gets you the full unicode character, even if it is a multi-character glyph in the input string. (You are dealing only with single characters, which means you can't handle Unicode characters outside of Plane 0. For example, many CJK ideograms are in higher Unicode planes, as are the Mathematical alphanumeric characters.)So you could use something like
as the main loop. Then you won't need the lines
https://github.com/limefrogyank/siunitx-pcc/blob/51e6c1e4d8e9429ee7451ffe2176db45501f6dbf/src/numMethods.ts#L223-L224
(You also don't need line 219, since that is done in the constructor for the
TexParser
.)You should also use the
subParser
to handle the parsing of the control sequence names. As with the regular expression ingenerateNumberMapping()
, the treatment of control sequences is incorrect, here. In particular,https://github.com/limefrogyank/siunitx-pcc/blob/51e6c1e4d8e9429ee7451ffe2176db45501f6dbf/src/numMethods.ts#L243
is not the only way that a control sequence can terminate. As I mentioned in the discussion of
generateNumberMapping()
, a control sequence consists of\
followed either by a run of one or more alphabetic characters (a-z and A-Z), or by a single non-alphabetic character (e.g.,\1
or\*
). You are allowing runs of any non-space, non-\
characters, which is incorrect.There is already a method of
subParser
that handles control sequence names (and the removal of the optional space that follows them), namelysubParser.GetCS()
, which returns the control sequence name without the initial\
(so you would need to add that back in yourself).In this way,
https://github.com/limefrogyank/siunitx-pcc/blob/51e6c1e4d8e9429ee7451ffe2176db45501f6dbf/src/numMethods.ts#L239-L247
could become just
That way, you don't have to worry about the rules for control sequence names yourself, and if there are any fixes made to how MathJax parses them, you get them automatically instead of having to update your code as well.
I notice that in a lot of places, here and in other files, you use
==
and!=
where===
and!==
would be more efficient. You should always use the three-character versions when possible, like when comparing strings or numbers, and only use the double character ones when comparing things that really may need type conversion, likex == null
whenx
could beundefined
and you still want that to match.The lines
https://github.com/limefrogyank/siunitx-pcc/blob/51e6c1e4d8e9429ee7451ffe2176db45501f6dbf/src/numMethods.ts#L226-L237
and
https://github.com/limefrogyank/siunitx-pcc/blob/51e6c1e4d8e9429ee7451ffe2176db45501f6dbf/src/numMethods.ts#L249-L260
are similar enough that you can collapse them into a single usage, as in
In fact, you could simplify this further by handling the uncertainty mapping differently. Instead of having the
inputUncertaintySigns
store a sub-map when the character already is mapped, you could have it store aCharNumFunction
that does theelse
code from above. That is,https://github.com/limefrogyank/siunitx-pcc/blob/51e6c1e4d8e9429ee7451ffe2176db45501f6dbf/src/numMethods.ts#L186-L194
could be
so that you always have a
CharNumFunction
inparseMap
, and then just useOf course, you then can change
https://github.com/limefrogyank/siunitx-pcc/blob/51e6c1e4d8e9429ee7451ffe2176db45501f6dbf/src/numMethods.ts#L162-L163
to
and avoid having to coerce the type of
mapping.get()
above.One of the differences between your implementation and LaTeX's is that in LaTeX, an invalid number will throw an error, e.g.,
\num{12a}
produces the error! Package siunitx Error: Invalid number '12a'.
Yours will silently produce 12. You may want to usemapping.get(char)
above beingnull
to produce such an error. (Similarly in your other parsers.)Finally, the lines
https://github.com/limefrogyank/siunitx-pcc/blob/51e6c1e4d8e9429ee7451ffe2176db45501f6dbf/src/numMethods.ts#L210-L214
can be simplified by chaining the
replace()
calls, as inso
text
doesn't have to be replaced each time.