highlightjs / highlight.js

JavaScript syntax highlighter with language auto-detection and zero dependencies.
https://highlightjs.org/
BSD 3-Clause "New" or "Revised" License
23.67k stars 3.59k forks source link

Perl 5 regexpes overreact #57

Closed KamilaBorowska closed 12 years ago

KamilaBorowska commented 12 years ago

Following code will see regexp / $b ... $c /, when it's obviously divide-by operator. Only second line is needed, but I needed to end regexp, so it would be visible (otherwise, everything would be just black. This doesn't happen in JavaScript and other languages with similar regexp syntax (like Ruby).

my ($a, $b, $c, $d);
my $variable = $a / $b;
my $othervar = $c / $d;
isagalaev commented 12 years ago

I think I fixed it. Can you check your code with the latest master revision?

KamilaBorowska commented 12 years ago

Confirmed. Broken. This change any broken non-obvious regexpes, like following:

split //, 'abc';

It was working previously. I'm aware that Perl is extremally context-sensitive language, but now it's second extreme. I don't think that parsing this without extreme ammount of work is possible, but what about checking spaces around / character (usually you would put space before regexp, but not inside it :P).

isagalaev commented 12 years ago

It's not that extremely complex, it's a matter of carefully defining after which lexems a regexp is allowed. Right now I put only a pretty general set of symbols but as your example showed we need to add some keywords too, as javascript does. For Perl, I presume, those will be at least split and return. Can you tell after which other keywords one may expect a regexp?

KamilaBorowska commented 12 years ago

Anything with non-empty prototype, except that in this case it would be considered as boolean value of match at $_ (for example, print /abc/ is actually print($_ =~ /abc/).

But more seriously, I would also add grep to this list, as grepping by regexp sort of makes sense... But if you would want completeness, things like my @matches = reverse /\w+/g also make sense in Perl (list of all words in $_ in reverse order)... yeah... I'm aware it can be annoying... Perl is very ambiguous... and if you will include subroutine prototypes...

It's nearly impossible to highlight Perl correctly. The syntax highlighter used by Github (Pygments) is pretty nice, but totally fails at $a / $b / $c (but it fails at lot more, like s < > { } (but it's not code you should write), oh wait... your tool also fails at it... but I think it's topic for separate issue, in fact, I've checked code like this in your highlighter just to find this issue).

Also, I have noticed that this change also broken regexes after statement modifiers (all statement modifiers should be allowed) and other literal keywords like or. This also should be fixed.

Also: Just a small note, I think you forgot typeof in that list in JavaScript.

isagalaev commented 12 years ago

Since highlighter is not a full-blown translator, it allows more freedom in handling syntax. The point to which it really makes sense to try is, basically, highlighting that doesn't break in a spectacular way. For example, if you mistake division for a regexp you have the whole rest of the line highlighted wrong, and it's ugly. But failing to highlight the regexp in some rare case is OK. So I'd be happy with just putting all those keywords you mentioned in a regexp context starter. Or you have better ideas (I saw you forked the repository)?

totally fails at $a / $b / $c (but it fails at lot more, like s < > { } (but it's not code you should write), oh wait... your tool also fails at it...

Can you be more specific? It's better to file a new issue on that.

BTW we have a discussion group, there are more people than here :-).

isagalaev commented 12 years ago

Added the keywords you mentioned. I'm going to close the issue for now, we can fix any remaining cases as they appear in the wild. Thanks!