rtfpessoa / diff2html

Pretty diff to html javascript library (diff2html)
https://diff2html.xyz
MIT License
2.87k stars 278 forks source link

Auto-detect language per-line is guaranteed to produce poor results #392

Open joshgoebel opened 3 years ago

joshgoebel commented 3 years ago

Hey, current maintainer of Highlight.js here. This just came to my attention via #391.

You're looping over lines and then calling highlightAuto on every line (when you don't have a known language). This is not recommended and guaranteed to produce poor results. Auto-detect is not intended to be useful with such little data and the noise will often (as reported in #391) be much higher than the signal - you're just as likely to get random languages than anything useful. There will be color, but often all wrong.

If you do wish to use auto-detect you should pass us the ENTIRE document (or at the very least all the available lines from the document/diff), then look at the language we determine it to be, then use that language for every single line.

You'll have to take this approach with version 11 anyways since you'll have to do the highlighting in a single pass (rather than per-line). So calling highlightAuto upfront for all available lines and letting it use the greater amount of content available for it's auto-detection... then splitting that result back out into the individual lines you need - already highlighted for you.

You'll have to do it twice of source, once each for the before and after streams.

rtfpessoa commented 3 years ago

Hi @joshgoebel,

Thanks for bringing this up.

I was under the assumption that the code was fine, and I even updated it based on your suggestion https://github.com/highlightjs/highlight.js/issues/2277#issuecomment-813084812 leading to https://github.com/rtfpessoa/diff2html/commit/2416b3d2b3f51237c5bff614faf359435117eb70

If I understand correctly you are saying that the only option to correctly highlight unknown languages would be to pass as much contents as possible and then split then line by line?

Does this also apply to code I already know the language?

The HighlightAuto usage is just a best effort try, since I am assuming the rest is doing a good job.

joshgoebel commented 3 years ago

If I understand correctly you are saying that the only option to correctly highlight unknown languages would be to pass as much contents as possible and then split then line by line?

Well, "correctly highlight" might be overstating it. The way IMHO to "correctly attempt" it is to give us all the content and let us decide what it is... that will at least have consistent results, even if not 100% correct. And it's not the "only" way... you could do a "pre-highlight" (of all the lines) first to get the language... then do your line by line approach... but again v11 will require a different implementation, so I don't think it's even worth pursuing that avenue.

Does this also apply to code I already know the language?

I opened this issue solely because of the random behavior with auto-detect when it's used line by line. But yes, with v11 how you are doing this will have to be rewritten because the engine does not expose it's internal parsing state to the outside anymore.

The HighlightAuto usage is just a best effort try, since I am assuming the rest is doing a good job.

But if the results are random and haphazard I'd suggest it'd be better to default to nothing.

rtfpessoa commented 3 years ago

@joshgoebel I believe I am not understanding exactly what you mean. So maybe if I try to reduce the problem space it will help me understand. For now, let's ignore the fact that we might not know the language for some files, consider that I am already targeting version 11 and that I can never get all the code in the files since diffs by default have a limited context.

I would like to understand what would be the best way possible to do the highlight.

If I invoke highlight.js for each line individually like this hljs.highlight(codeString, { language, ignoreIllegals: true }) is it going to be a problem in v11? If yes, what would be the alternative? Would I have to pass all the lines I have for it to work?

iHiD commented 3 years ago

(@rtfpessoa This whole thread might be useful reading: https://meta.stackexchange.com/q/355852/188348)

rtfpessoa commented 3 years ago

(@rtfpessoa This whole thread might be useful reading: https://meta.stackexchange.com/q/355852/188348)

@iHiD thanks for the reference. Will definitely read it.

joshgoebel commented 3 years ago

If I invoke highlight.js for each line individually like this hljs.highlight(codeString, { language, ignoreIllegals: true }) is it going to be a problem in v11?

That's not really what you want to do as it will break on any scopes that persist past the end of a line boundary. What you'd really want to do:

You'd really need to do this for each section of a diff (if they are non-sequential). So if a diff included 3 discrete changes, ~10 lines each then you'd be grouping each of those 3 changes into blocks and then highlighting all 3 blocks. Then splitting them apart again to get at the individual highlighted lines.