highlightjs / highlight.js

JavaScript syntax highlighter with language auto-detection and zero dependencies.
https://highlightjs.org/
BSD 3-Clause "New" or "Revised" License
23.68k stars 3.59k forks source link

Proposal: 0 relevancy by default #2826

Open joshgoebel opened 4 years ago

joshgoebel commented 4 years ago

The great relevancy cleanup:


Original issue:

Is your request related to a specific problem you're having?

Relevance (and hence auto-detect) is all over the map because every mode receives 1 relevance by default. Just assuming that because something should be highlighted (or parsed) does not always mean it should be relevant to auto-detect. Proper auto-detect function [currently] requires VERY careful curation of the modes at a high-level (across many languages)... meaning if one language claims relevance for a specific syntactic structure then any other language that ALSO includes that structure also must claim relevance... otherwise one language just wins by "default".

IE, our recent support of operator is one example... as things stand now operators can't be given default relevance. IE, if we start by adding operators (and relevance) to a few languages then every snippet of code doing lots of math (operators) will now always "win" because it's getting points for operators where-as every other language (who also presumably share many of the same operators) is not getting any points.

This "balance" typically works with strings, comments, and such things because we provide MODE helpers for these that enforce relevance consistency across grammars.

The solution you'd prefer / feature you'd like to see added...

I'd like us to consider that modes receive 0 relevance by default, not 1. That all relevance should be opt-in, rather than opt-out. Grammars should try very hard to only claim relevancy for things that are truly relevant. This would result in more thought being put into relevancy and remove a lot of the relevance:0 dance we currently have to do .

Note I'm talking specifically about modes here, keywords would still retain there 1 by default... (see thoughts on keywords below). Right now it's too easy to accidentally add relevance with a complex ruleset... One quick example: beginKeywords... because this is BOTH a mode and a keywords key any keyword matched with beginKeywords now counts DOUBLE. I've been fixing this on a one off basis, but even if we make no other changes here it's likely that should change. But it's just one example of how easy it is to "accidentally" add relevance.

Suddenly any language trying to more nicely parse something like function blah() (which in one form or another is common in MANY languages) gets double points where-as all the languages without an explicit rule only get single points. Unfair.

It's much harder to accidentally add relevance like this with keywords because of how explicit keyword relevance is.

Any alternative solutions you considered...

I've often wondered if there should be a max count on how many times a rule can score relevancy. Ok, so we see int which is a keyword in language X, that tells us something - for sure, but if we see int 1000 times, does that really mean your language is 1000x more likely to be X?

Perhaps we shouldn't be looking at overall relevance scores but rather how "widely" the scores are spread... (this would require research I think)... IE, it should matter a lot more than your code include 100 different keywords from X than just a single keyword 100 times in a row... (which really might not be X at all)

It's possible these two approaches would pari well together also. Keywords "naturally" balance to a degree because every language has a list (it's usually the one thing even very simple grammars get right)... so if both Basic and Pascal have "for" then neither gets an advantage even if the code contains for 1000 times.

joshgoebel commented 3 years ago

Perhaps a defaultRelevancy setting at the language level would make more sense?

egor-rogov commented 3 years ago

TF-IDF comes to my mind. The technique allows to find documents (from the corpus of documents) similar to the given document (pattern), ordered by relevance. This quite resembles our autodetection problem.

Suppose we have a corpus of programs in different languages. TF-IDF defines that importance of the term is higher for terms that are rare across the corpus, and for terms that are frequent for this particular language.

It makes me think that the idea not to count frequency of terms is probably not so good. And obviously rare terms (keywords etc) should have higher importance (relevance).

Wouldn't it be fun to make autocalculation of weights for autodetection?.. (:

joshgoebel commented 3 years ago

Well that's a bit broader topic... and either way I'd think the "default" relevance would still come into play somehow, ie a 10 would mean something different than a 1, even if it only becomes some sort of multiplier.

Wouldn't it be fun to make autocalculation of weights for autodetection?.. (:

Yes, but first we need a larger sample set of data (multiple sets would be best) so that when making big picture changes we could actually confirm they were a step in the right direction.

I've had two thoughts on this:

  1. Alter the relevance of keywords based on their frequency across grammars (as you are suggesting)... a "boogaloogy" keyword only found in a single grammar would have MUCH more relevance than say "for"... one objection has been that this requires compiling all the grammars upfront, but I suppose that is only true for auto highlight... and we can't highlight without compiling them... so perhaps this is entirely doable.

  2. Score based on width, not depth. IE, keep a hash of which keywords (and rules) match... if Ruby matches 20 different rules but Nim only matches 3 rules (10 times each) then Ruby should win, even if Nim's raw score (of 30) is higher. This would prevent runaway rules or single keywords from entirely breaking auto-detect because they were analyzing something where the keyword in question was actually simply an identifier.

Not sure these would have to be mutually exclusive, but I think the latter would help solve a lot of the false auto-detect I've seen caused by a single run-away rule matching over and over and over.

joshgoebel commented 2 years ago

I'm bringing this idea back and switching to a tiny default relevancy for regular rules (like 0.05)... just enough to make them count a bit... with the preference being that all REAL relevance going forward needs to be "opt in" - and should typically only be based on keywords (that are implemented as modes) or very strong syntactical elements.