highlightjs / highlight.js

JavaScript syntax highlighter with language auto-detection and zero dependencies.
https://highlightjs.org/
BSD 3-Clause "New" or "Revised" License
23.31k stars 3.52k forks source link

[chore] Moves names of languages into data of grammar itself #2394

Closed joshgoebel closed 4 years ago

joshgoebel commented 4 years ago

Before, from the header:

/*
Language: Java
Author: Vsevolod Solovyov <vsevolod.solovyov@gmail.com>
...
*/

Note, we're not changing the headers, just copying the data into the grammar definitions.

After (in the grammar definition itself):

return {
    name: 'Java',
    aliases: ['jsp'],
    keywords: KEYWORDS,
    illegal: /<\/|#/,

I think it makes sense for name to be the very first key. It rarely changes and might make resolving merge conflicts easier sometimes.

egor-rogov commented 4 years ago

And how about DRY? (: I'm not opposing language name in the grammar, but I'm not happy with the duplication. Can we remove the name from the header then and make our tools to look for a different pattern?

joshgoebel commented 4 years ago

Well, technically it's just JS... so a pattern (technically) isn't good enough... you'd have to execute the JS live and then pull out the name. I almost did this with external_language in the new build pipeline (to see find out if the module exported definer or not) but then decided I didn't love the security implications and instead went with a regex to do the same thing (it's seem "good enough" for now).

And the comment header is also VERY much tied into the website/hidden part of the build process that we have no visibility to or way to change, so I don't think we're getting rid of it anytime soon.

I do hear your concerns about DRY though, just not sure how to resolve it... you could have the build process add the names dynamically, but I'm not sure what I think about that. It's kind of weird since the grammars return a function... you'd end up with auto-registration code that looks something like:

var temp = function(hljs) { ... };
temp.name = [name from the build process];
hljs.registerLanguage(temp);

So I think that means now we'd also then need a closure just to prevent polluting the top-level namespace with our temp var? Although I think ES2018 syntax could solve that also...

And then it's "magic" where the name is coming from - people who just look for it in the language grammar won't find it... vs it being explicit if we just add it plainly to each file.

joshgoebel commented 4 years ago

Actually that doesn't work, since it's a function... (you could still set the attribute but now you'd have to add special code to the library to look for the name on the function object itself, not the executed value of the function). All of it leaves me with a bad taste in my mouth.

taufik-nurrohman commented 4 years ago

Now you'd have to add special code to the library to look for the name on the function object itself.

How about adding hljs.listNames() method to the core as alternative to hljs.listLanguages() ?

egor-rogov commented 4 years ago

I don't like the idea of that dynamic stuff in the build system, too. Can you remind me what are the use cases for names in the grammar? Other than just "it makes sense"?

joshgoebel commented 4 years ago

How about adding hljs.listNames() method to the core as alternative to hljs.listLanguages()

That isn't what I was talking about. We're talking about how to get the names into the grammars in the first place - not how to get them out - that's the easy part. After they are there I'm not convinced we need an API just to get the names... I'd probably rather just have an API that returned an array of grammars and then someone could call map if they wanted names or any other meta-data.

Can you remind me what are the use cases for names in the grammar? Other than just "it makes sense"?

Client side use. In my service (for example) I'd like not only to know which grammar matched but I'd like to know the name of the language so that I can display it in my UI - without having to build my own mapping table. HLJS is already encoding the data, we should just encode it in a way that it can also be accessed as run-time.

This isn't the highest priority but it seems such a useful thing and kind of silly not to have it.

joshgoebel commented 4 years ago

I don't like the idea of that dynamic stuff in the build system, too.

It really wasn't about dynamic vs static itself but rather more of a security concern about running random 3rd party code that might be installed as a 3rd party language grammar. I don't think I (as a user) would assume that the build system is going to RUN that code as part of building it, so it would defy expectations, and I typically try not to do that.

egor-rogov commented 4 years ago

Yes, and another concern is over-complication of a simple matter. It looks that duplication is the lesser evil.

egor-rogov commented 4 years ago

Another thought. In the header there are sometimes rather detailed names, such as "ArcGIS Arcade", "Microsoft Axapta (now Dynamics 365)", "PostgreSQL and PL/pgSQL" etc. That's probably fine for the list of available languages we have in README, 'cause it's easier to find the desired language this way. But what do you want to return to the application? These long names or just "Arcade", "Axapta", "PostgreSQL"? So maybe it's not a duplication after all.

joshgoebel commented 4 years ago

I had that thought also. I did make a small attempt the other day to clean up the names a bit. It was tiny so I just committed it directly.

https://github.com/highlightjs/highlight.js/commit/6c20b9bcf73114a40f9e42b7c961e208387a7d1c

It's harder than you think because ideally the goal would be "short, if recognizable"... ie I changed "SQL (Structured Query Language)" to "SQL"... but a lot of that is going to depend on someone's exact background.

Perhaps we need shortName and longName. ;-)

joshgoebel commented 4 years ago

I'm not convinced this has to be done all at once either, so I might start asking name as I improve languages in the future and we'd default to the "label" until every language has a name. Are you opposed to that?

egor-rogov commented 4 years ago

Perhaps we need shortName and longName

The more I think about it, the more I'm convinced that the header should contain lang names and the grammar should contain short names. The former is for searching for a language, the latter is to identify a language.

It's harder than you think because ideally the goal would be "short, if recognizable"... ie I changed "SQL (Structured Query Language)" to "SQL"... but a lot of that is going to depend on someone's exact background.

I can hardly imagine someone who wants to find out if we support SQL by searching for "Structured Query Language" rather than "SQL". (Same is true for BNF, which we have as "Augmented Backus-Naur Form"...).

To sum up: we agreed about usefulness of language names in grammar, so let it be. I'm okay with the idea to get there gradually.

joshgoebel commented 4 years ago

The more I think about it, the more I'm convinced that the header should contain lang names and the grammar should contain short names.

Well, the header names are used on the download form... which is ugly if the names are long, which is why I shortened some of them. :-)

I can hardly imagine someone who wants to find out if we support SQL by searching for "Structured Query Language" rather than "SQL". (Same is true for BNF, which we have as "Augmented Backus-Naur Form"...).

See, I 100% agree with SQL, but I don't have the same name recognition at all for *BNF so I left them as-is. I know what they are for, but off the top of my head on a rainy day I couldn't expand the acronym for you.

I'm okay with the idea to get there gradually.

Awesome.