rsmenon / pygments-mathematica

Mathematica/Wolfram Language lexer and highlighter for Pygments
MIT License
59 stars 8 forks source link

Base numbers #4

Open halirutan opened 6 years ago

halirutan commented 6 years ago

I haven't tested it but I was looking over the code to implement the Rogue highlighter in a similar way. When I see this right, then this line is suspicious

BASE_NUMBER = r'{integer}\s*\^\^\s*({real}|{integer})'.format(integer=INTEGER, real=REAL)
  1. Why is there whitespace allowed before and after ^^?
  2. Does it catch the case 2^^00110011*^+3?
  3. Does it work with characters like in 16^^abc?
rsmenon commented 6 years ago

@halirutan Sorry for the delayed response; I missed this report and only accidentally saw it now. To your questions:

  1. I think I allowed for whitespaces to have a "looser" matching based on the person's intent rather than strict language parsing rules because this pygments lexer is primarily used for print and web and not for IDEs. I know that I thought through this for a few cases, but I'm not sure if I was consistent in that everywhere else.
  2. I haven't tested it, but I'm fairly confident that it will not identify 2^^00110011*^+3 as a single BASE_NUMBER. I think it might identify 2^^00110011 as BASE_NUMBER, *^+ AS OPERATORS (3 separate ones, since *^ is not listed as a single entity) and 3 as an INTEGER.
  3. This is a good catch! I did not consider base 16 (or more generally till base 36).

So for 1 and 2 — I'm not sure that the extra effort into handling these cases is worth it, given the context where this lexer will be used. If you're using the default colors, then they'll all show up as black just like in the notebook so visually it shouldn't be different even though the underlying lexing is imperfect. But if you use different syntax colors for integers, operators and base numbers then these could look jarring... not sure of the best way to solve this (and other issues like consistent lexical scope colors, etc.) without turning this into a full blown parser :) If you have ideas/improvements for these, please feel free to send a PR!

halirutan commented 6 years ago

@rsmenon I'm working towards a set of regexes that can be used as a toolbox for any lexer and will give coherent behaviour. Regular expressions may differ in different languages, but they often have similar capabilities with slightly different syntax.

You might have seen this question of mine that shows how we can use Mathematica's own LetterQ to create a full-fledged regex that catches all valid characters even if symbols contain things like symα. I did the same now for numbers and will show the code at the end. About your points:

I think I allowed for whitespaces to have a "looser" matching based on the person's intent rather than strict language parsing rules

I understand that argument, but I don't agree here :) The thing is, that there is no whitespace allowed around ^^ and this plays into our hands because it makes correct lexing of valid numbers possible. Btw, I found a nice example to give your lexer the hiccups.

32^^Function
(* 548393634583 *)

About point 2. Yes, I would agree that this happens. If someone has numbers in a different colour (which I usually have), then this would be weird to see the parts of one number coloured differently.

Therefore, what I'm trying to do is to write down and test patterns for symbols and numbers inside Mathematica and then we have a reproducible way to create the regex from it. Consider this for numbers:

number = {DigitCharacter .., "." ~~ DigitCharacter .., 
   DigitCharacter .. ~~ "." ~~ DigitCharacter ...};
baseNumber = {HexadecimalCharacter .., "." ~~ HexadecimalCharacter ..,
    HexadecimalCharacter .. ~~ "." ~~ HexadecimalCharacter ..};
base = DigitCharacter .. ~~ "^^";
precicion = "`" ~~ RepeatedNull[RepeatedNull["`", 1] ~~ number, 1];
scientific = "*^" ~~ RepeatedNull["+" | "-", 1] ~~ DigitCharacter ..;
final = {number, base ~~ baseNumber} ~~ RepeatedNull[precicion, 1] ~~ 
   RepeatedNull[scientific, 1];

StringMatchQ[#, final] & /@ {"123", ".123", "123.123", "16^^aa", 
  "16^^.aa", "16^^.aa``30*^+10", "32^^Function"}

You might complain that 32^^Function gives false, but I used HexadecimalCharacter as I have never seen code that uses numbers above base 16 :smiley:. Now, we can test all different kinds of numbers directly inside Mathematica and see if they match. Then, we can call

StringPattern`PatternConvert[final]

and get one single regex for numbers. Therefore, no developer has to write down all the different cases anymore. He only needs to take care of converting the regex into his language if e.g. certain character classes don't exist (like :xdigit:).

I will test this for the Rogue highlighter and if I find any time probably try to fix it for pygments too.