Policy: Discuss extras repository for additional languages

joshgoebel commented 5 years ago

I see we have a lot of requests for new languages in the PRs... if the issue is time/maintenance over time, etc... might we perhaps consider an "extra" respository or something with community/unsupported syntaxes? That way the criteria to be approved could be lessened a little and obscure languages that might not really make sense in Highlight.js proper could still have a home?

Or is the idea that eventually we'll get to them?

joshgoebel commented 5 years ago

Are we still moving languages into repositories? Is that ALL the languges, or only "additional" ones?

marcoscaceres commented 5 years ago

Only new ones because then we can make submitters the repo maintainers.

joshgoebel commented 5 years ago

So what is the policy on new (or very old) PRs already in the queue? Is there some cut off date or pretty much the answer now is always "separate repo"?

marcoscaceres commented 5 years ago

Yep, separate repo.

jf990 commented 5 years ago

The separate language repos concept addresses a maintenance problem that has caused consternation here since @isagalaev slowed down the regular maintenance. I think the separate repo idea is going to address that but we need to discuss the many problems it introduces. We should decide how to handle these things because they change the way this library has been managed since its inception.

Discovery. Available languages and how to use them are documented on highlightjs.org. By putting new languages in separate repos we are effectively invalidating this entire page. No longer could we have a single source of truth as to what languages are available and how to get them. Even worse, old languages that were part of the original highlight js will be handled and documented differently than the new ones. this will make the library difficult to use and documentation will be all over the place as different language authors choose their own way of doing things without the oversight of the maintainers. We should figure out and document some minimum standards for language contributions. (To do this right) we should even move all the existing languages into this new format such that all languages are handled the same making using the library consistent. We should also come up with a meta data file that each language repo is responsible to maintain so that we can automate some aspects of the discovery, testing, detection, and packaging.
Auto-detection. I'm not sure we are going to easily have language auto-detection work as well as it does now. with the separate repos there's no easy way to run the test against all the other languages and determine if relevance is working.
Testing. Current testing methodology is well defined and there is a process to handle it. This method is no longer valid and individual repos are on their own. So far with the new languages definitions we have not established any requirement for testing, and we see that some do and some don't.
Packaging and deployment. We probably can no longer support a page like Getting highlight.js, or not for the new languages. This may or may not be a real problem for some as packagers and modern workflows may reduce this requirement, but some developers are still loading a custom package in a script tag and would be required to completely change their method if they wanted to include a individual language. It would also be not so nice to offer that page and not include the newer languages, so we have some thinking to do there.
Quality. Since we are now deferring individual language maintenance to the individual authors (or those interested in contributing) and not the central highlightjs maintainers, going forward the quality and maintenance of languages will be indeterminate at best. Our current process at least has the maintainers as the gatekeepers making sure we don't include poorly implemented, undocumented, or buggy language contributions. But the new way will remove that oversight. For a community driven project I guess this is the expectation but in the past this repo has been very well maintained, and I think that is part of the reason it gained such traction and popularity.

Most of this is solvable with new documentation on the language contributors page and some new development such as a language registry and supporting process scripts like test and build to support that. But it's all a lot of work.

I still haven't been able to figure out how to update the documentation. Have we documented how stuff in docs gets built and deployed to highlightjs.org and highlightjs.readthedocs.io?

BTW, really nice work @yyyc514 going through all those issues 🚀

joshgoebel commented 5 years ago

. I'm not sure we are going to easily have language auto-detection work as well as it does now. with the separate repos there's no easy way to run the test against all the other languages and determine if relevance is working.

Well, the "test it" part is just a tooling problem but I think logically this will prove impossible with the infinite range of possibilities as languages grow and grow, yes.

Testing

I think to be "semi-official" or included in some type of global list that there should be some minimal amount of specs that a syntax is required to pass.

But obviously if someone just wants to rip a language down from somewhere and use it then we can't stop them. We do have power to decide what gets hosted at highlightjs org though, so that's something.

It would also be not so nice to offer that page and not include the newer languages, so we have some thinking to do there.

Well, I wasn't going to say it publicly, but I guess now I will. I can't speak for everyone but I can't imagine this ban on new languages in core is ABSOLUTE. It's to prevent a proliferation of 100 tiny languages stagnating that no one has time or inclination to maintain (or to baby sit PRs, tests, QA, etc). If the next Swift comes around and 50% of the world is writing code in it, I imagine we'd consider adding it to core and someone would make time to maintain it.

Although if we figure out this whole "separate repo" thing perhaps eventually none of the languages will be in core... but seems that's a bit far away at the moment.

Our current process at least has the maintainers as the gatekeepers making sure we don't include poorly implemented, undocumented, or buggy language contributions. But the new way will remove that oversight.

Yes, this is a for sure concern and why I think there should be some gateway between "one of the core contributors has read this, or agreed it passes "reasonable" specs and "who knows I just found this laying around somewhere". It would be very bad if someone installs a shiny new "Pancakes 1.0" syntax that locks up their website and blames it instead on Highlight.js.

I still haven't been able to figure out how to update the documentation. Have we documented how stuff in docs gets built and deployed to highlightjs.org and highlightjs.readthedocs.io?

No idea. Someone who knows how needs to find time to write up something ROUGH... and then we need to find people who have the time an inclination to help keep docs updated. They could iterate on the rough docs and push them forward.

BTW, really nice work @yyyc514 going through all those issues 🚀

Thanks.

jf990 commented 5 years ago

@yyyc514 I had offered to pitch in on the docs in some other issue here, and in particular to help rewrite the contribute a new language guide. I had worked on redoing the languages I help maintain as separate repos to work out the way to do it.

I setup a github template project, we can review this effort and see if it's on the right track. It sets up a template project with a unit test to get started with a new language.

https://github.com/jf990/highlightjs-language-template

joshgoebel commented 5 years ago

@jf990 See my thoughts on this thread regarding auto-detection:

https://github.com/highlightjs/highlight.js/issues/1213

I actually think there are some pretty great ideas (or the beginnings of ideas there). In this context instead of saying "Yes, all 10,000 languages from all maintainers have NO conflicts!" We'd run the tests and then say:

Hey, you, maintainer of "Pancake 1". Your language is 95% "sticky", it thinks almost any language is Pancake... our recommended threshold is Y... you need to tune your relevancy scores to prevent false positives against other languages.

Need a better word than sticky, lol.

And then we'd have a metric we could use for including in a master list, including on the "main website" for build packs, etc...

jaredlll08 commented 5 years ago

So I have been looking at a few of the issues over the past few days, and feel I can maybe provide some outsider input.

Before I start, I want to make it clear, I am not a proper JavaScript Developer, I primarily work in other back-end languages, so my approach may be naive or going against best practices, but I do believe it will solve most issues.

The approach is very similar to what GitHub does with their Linguist library, and that is using Git Submodules.

I have made a proof of concept for everything I am going to say below, which can be found here, like I said above, I'm not that well versed in JS build tools (and it is currently 5AM), so I did a dirty hack to get it to work, in reality it will either need to be all or nothing, with all languages following the same format (being in a folder with the language name for example), but having a standard format for languages is good in my opinion.

Things I changed in the proof of concept:

Languages are now loaded from a folder, so src/languages/$LANGUAGE_NAME/*.js
Reading snippets uses the name of the folder the language is in(so $LANGUAGE_NAME) instead of the js file (could be changed, I did it because my file was called index.js, you could enforce a proper name when making the repo)

Right now, based on issues I have read on this repo, if someone wants a new language, they make an issue / PR on this repo, then a @highlightjs member creates a repo for their language, and adds the original author of the issue / PR to that repo for their language to live in.

As @jf990 said, while it helps with maintenance, it comes with a few drawbacks, I believe Git Submodules can fix most of those, like so:

1) Discovery

This is the whole reason why I am even commenting here, I made support for a language, and now trying to get products to use it is an up hill battle, most developers feel that I should try and get my language to be supported in HighlightJS itself, which right now, isn't an option.

So the solution, since manual work is already required when someone wants to add language support (making a repo for them and add them as a collaborator), then it shouldn't be an issue to make a commit to this repo, adding that newly created repo as a submodule, like I do in this commit of the PoC.

This will only add the submodule to the src/languages folder, so I do not have a solution for files in the test folder, I can look more into it if this solution is something that will actually be considered and would work for this project.

So with the 3rd party languages now in the languages folder, it is treated exactly the same as a "first party" language, so it is built when running node tools/build and shows up on the build/demo/index.html, and I believe, tests are still ran on it, I don't see a reason why they wouldn't be.

2) Auto-detection

See above about third party languages being treated as first party, so this would no longer be an issue.

3) Testing

Like I said above, I don't see why tests wouldn't run on these third party languages, besides the issue of getting other files into the test directory.

4) Packaging and deployment

Once again this is all handled because the files are physically there when the commands are ran, I'm not sure how the Getting highlight.js page is generated, but if it is generated based on the src/languages folder, it should be trivial to move to the new src/languages/$LANGUAGE_NAME format.

The one thing that would need doing, or at least would be a quality of life feature, would be having the build script update all the submodules to their latest commit, but this can be done manually in the case of broken commits on submodules, or just forcing a submodule to use a specific commit.

5) Quality

I have no solution for this, the only thing that comes to mind is implementing a set of requirements and guidelines for new languages, for example, Github Linguist (example), they require the language to be used in "hundreds of repositories", for them, being Github, that makes sense, since the PR would affect those repositories, so for HighlightJS, it gets a bit tricky, you could use the same metric as Github, and that way you could help ensure that:

1) A language is used (I don't know how highlightJS members feel about this, the docs do say any and all languages are allowed). 2) There are open source developers using the language (if the maintainer of the highlightJS language support dissapears, it is possible a new maintainer who has experience with the language could step up).

Or highlightJS could work on a different metric, I will admit that this doesn't seem like an easy issue to solve.

As for the documentation, I have used ReadTheDocs in the past and have some experience with it, so I am happy to help figure out how it works, and from there document any of the changes I listed above (if implemented) to help ensure that everyone knows what the new protocol is.

I hope this all makes sense, I am happy to go into more detail on git submodules if need be, or even brainstorm a different solution (possibly having the build script traverse the highlightJS organization and pull the languages from there, that way there are no submodules).

joshgoebel commented 5 years ago

From another thread (I'm replying to @egor-rogov) (https://github.com/highlightjs/highlight.js/pull/1829):

Egor: This language is already in the core. Why should we move it away?

Well, already in or not is a very weird (meaningless?) metric (IMHO) to decide whether that's where they BELONG. I thought the whole idea of not letting more languages in [to core] had to do with developer time/maintenance/responsibility/who is in the best position to maintain the language long-term, etc... so surely the right way to think about existing languages ALSO is how they fare on those exact same metrics...

This "already in core" vs "sorry, you just missed the cut-off!" is a VERY weird and arbitrary line.

joshgoebel commented 5 years ago

Not a fan of git submodules, though I haven't worked with them in years. It's possible they have improved. Back then all I heard was whining about how annoying they were.

I do see the advantage of "just works" (other than for tests, which you didn't go into in great detail)... but I don't think the paths is the hard part...

Having a languages.toml or languages.json file that anyone could contribute too and a smart build tool (doesn't have to be that smart) could accomplish the same thing.

Our build pipeline is crazy old and needs replacing anyways - so keeping it "as-is" isn't a priority.

joshgoebel commented 5 years ago

Another possible suggestion:

Make your own repo following a template with your own tests etc.
Add your grammar to a "blessed" languages.json in the master repo, make a PR
If it looks even semi-reasonable, we merge. (the goal being discoverability, not policing)
Fix tooling so ./tools/build -t node javascript cpp some_weird_3rd_party "just works"

I'm also a fan of a shared language repository. highlightjs-grammars. I think that has a LOT Of advantages that aren't being considered yet... like higher visibility and more likelihood that the community will pitch it - one consolidated place for grammar issues, etc... I think the fear is that no one will "own it" and issues will go unanswered, etc...

egor-rogov commented 5 years ago

The idea about submodules is very interesting. I use them (in other project, not on github) and didn't find them annoying or something. The huge advantage I see is no distinction between "in core or not" languages. I think we can turn test directory into test subdirectories for each language, so that repo owner have full access to all relevant contents. It requires some more thought and experimenting, of couse.

egor-rogov commented 5 years ago

I'm also a fan of a shared language repository.

What do you mean by this, @yyyc514?

egor-rogov commented 5 years ago

This "already in core" vs "sorry, you just missed the cut-off!" is a VERY weird and arbitrary line.

Absolutely! I really like to removing this barrier.

joshgoebel commented 5 years ago

Just another SHARED repository... so you have "core" then you have "extras" (which has a bunch of languages)... and we'd "police" core more carefully than "extras"... (if we plan to keep a distinction at all long-term).

The idea about submodules is very interesting.

Not opposed to trying if you've had good experiences. How do they work when the submodule just drops off the planet? or someone disappears from GitHub and takes their work with them? Easy to fix?

I think we can turn test directory into test subdirectories for each language, so that repo owner have full access to all relevant contents.

Yeah, I think the tests would move into the languages and then "running the full suite" would have to be taught to look there if you truly wanted to run EVERYTHING.

joshgoebel commented 5 years ago

Submodules + tests is going to require fixing those ~~silly~~ annoying relevancy tests. ;-) Or would we simply not run them for submodule languages? I need to go back to giving that a little more thought.

It's bad enough with 184 languages it'd be even worse with 250...

jaredlll08 commented 5 years ago

How do they work when the submodule just drops off the planet? or someone disappears from GitHub and takes their work with them? Easy to fix?

Funilly enough, when I was making my PR to linguist, this exact thing happened, someone deleted a repo that was being used as a submodule, thankfully they were active and got github support to restore it, but that is really not ideal.

I think a requirement for the submodules should be that they need to be under the highlightJS organization, that way only a team member can delete the repo and no one (in theory) can take their work with them.

joshgoebel commented 5 years ago

Oh so it works poorly? LOL.

jaredlll08 commented 5 years ago

If that is what you want to take from what I said, sure, they work poorly when you have a submodule of someone else's repository, and they decide to delete the repository.

Which is what I addressed in the second paragraph.

If the repositories are under the @highlightjs organization (like this repository https://github.com/highlightjs/highlightjs-robots-txt for example, or any of the other repositories that have been made for third-party language support), then the only people who can actually delete that repository, are people in the @highlightjs organization, so in this case, it should be fine.

joshgoebel commented 5 years ago

Yeah, I followed that. Just it sounded a lot better when the only thing we had to to was accept PRs to "link" them... rather than host them all as well. :-)

jaredlll08 commented 5 years ago

Well you are currently hosting them, so it isn't much of a difference.

I myself wouldn't be too worried about people deleting repositories and taking their code, Linguist has over 300 submodules and if it happened that often, I'm sure they would have found a different solution.

If it does happen however, all that would need doing is to just remove the submodule, which would remove the language support, but this would most likely only happen on more third party languages, as you said:

if the next Swift comes around and 50% of the world is writing code in it, I imagine we'd consider adding it to core and someone would make time to maintain it.

So the "common" languages would probably have "official" highlightJS support, and you would only need to worry about the more "uncommon" language repositories getting deleted.

So if a submodule was deleted, there are a few things that could be done: 1) Depending on the license, rehost the support. As long as someone has the language on their computer (which would be pulled when pulling from this repo), they could upload the language to a new repository and have highlightJS pull from that repository. This only works if the license permits this though.

2) Write support for that language. If it is a must have language that somehow got deleted, a community member could write a new support library for it (this is the least ideal solution in my opinion)

3) Remove the language. Just make it clear in the changelog why the language was removed, a simple: "Language X was removed because the author of the package, Y, deleted the repository holding it". You are then shifting blame onto user Y, and if people were actually using the language, a new maintainer could step up and write a new support library.

joshgoebel commented 5 years ago

Does using submodules become an issue when people want duplicates or have differing opinion on core style choices? How do we handle that? Someone has a PHP grammar that is MUCH better than ours (but perhaps it's too colorful, or it's too large, ours is more "minimal", etc)... do we just give it a different name and then let people build it by name?

IE, php-super?

I'd imagined some way that such things could "grow organically" over time then one day when it turns out everyone prefers php-super perhaps it would become php default, etc.

jaredlll08 commented 5 years ago

In a scenario like that, so firstly whoever made php-super would probably have ran into the issue of highlightJS adding using the php name, so I would imagine they would have used a different name already.

I really do think this scenario is a bit out of scope for this issue. The way I would deal with something like this, would be to add variants to languages, so instead of the class name being like: hljs php it would be something like hljs php super.

or better yet, add support for language replacement (not sure if this is already a thing), so the person who made php-super would register their language like:

hljs.replaceLanguage("php", hljsPHPSuper);

So instead of registering it as a new language, they register it as a replacement, and highlightJS will use their grammar instead.

In my opinion, grammars for already existing "core" languages would be denied, and would be opt in, if someone really wanted php-super they could pull the package in from NPM / CDN.

If in a few months if everyone is using php-super, then maybe host a poll on what people want, if they want php-super to be native, or if they want the current php style, and go from there.

joshgoebel commented 5 years ago

For sure I'd support replaceLanguage (or re-register)... I was just thinking of what an 'open' ecosystem might look like. We have some opinions (that we aren't even consistent about) that would seem to be holding some grammars back... I'd like to see it even easier that what you describe for someone to decide they like a different flavor of PHP say, and just checkout that repository, build, and then they are using a custom package...

If in a few months if everyone is using php-super, then maybe host a poll on what people want, if they want php-super to be native, or if they want the current php style, and go from there.

Sure, I guess I was just imagining it might happen more organically than that... let people vote with what they build - but then I'm not sure how many people build this themselves vs just use a packaged version...

Obviously if they're just using the default set then they're getting what core wants them to have in any case - and would have to plug in things on top of that.

jaredlll08 commented 5 years ago

The problem I see with voting with what they build, is how can you actually track that? Unless there is analytics code built into the build process, I don't see a feasible way to actually know what people are using.

What I do know though, is that a good amount people are using "what the core wants them to have", since they are just pulling it from NPM (based on weekly downloads) and not building it themselves.

The whole reason I am here is because I wanted Discord to support my language, but after speaking with people, the general consensus is that I should try and get my language into highlightJS itself, since they don't want to have to add another library, they just want to pull highlightJS and ideally have the language without any extra hassle.

I know I may have some bias, but I honestly think that the language situation should be sorted out in general, before worrying about someone making a different flavour of PHP, the people who care about having a different flavour of PHP, are most likely the same people who would be willing to build the package themselves to get that flavour.

Since right now new languages aren't being used, and to echo what was said in this comment

Creating a GitHub repo on your own that no one is ever going to find doesn't feel much like "contributing" to the project... It feels little bit like we're telling them "frack off, we don't really care you and your style contribution".

that applies to anything in this project, not only a style, a language grammar as well.

joshgoebel commented 5 years ago

but I honestly think that the language situation should be sorted out in general, before worrying about someone making a different flavour of PHP

I take your point but I see them as one in the same issue really. :-) It's all about how we answer the "can i contribute?" question. That could be interesting too if we had submodules since I imagine criteria for adding things to submodules would be entirely different than that of adding things to core... so the distinction between which is core and submodule would matter in some areas.

the people who care about having a different flavour of PHP, are most likely the same people who would be willing to build the package themselves to get that flavour.

Well, depending how easy it was to build you might be right... but if it's hard then I think they might want it very badly but don't have the time, energy, knowledge, etc...

jaredlll08 commented 5 years ago

Well, depending how easy it was to build you might be right... but if it's hard then I think they might want it very badly but don't have the time, energy, knowledge, etc...

Well that would be up to whoever implements themes, but if a different language variant was getting popular, I'm sure someone would build and host the package with the new grammar for others to use (possibly even the person who made the original grammar would do the hosting for that).

Could we please get back on to the topic though, right now I don't think you should be worrying about people adding support for a language that already has support, because right now you aren't even accepting ANY new languages, if they have existing support or not.

joshgoebel commented 5 years ago

Could we please get back on to the topic though...

We never left it... :-) I believe the true question is still "I have a grammar to share, how do I make it easy for Highlight.js users to find it and use it?" That might be a new grammar (including one that's really still in "beta" or barely ready), or it might be a change to an existing grammar.

I wonder if there are any implications for packaging (npm, etc) with submodules? Is that going to make it harder for people to also ship their language as an npm package or anything weird I wonder?

jaredlll08 commented 5 years ago

I wonder if there are any implications for packaging (npm, etc) with submodules? Is that going to make it harder for people to also ship their language as an npm package or anything weird I wonder?

No implications, in terms of publishing highlightJS to NPM, NPM won't even know that the submodules exist, it doesn't change anything about the build.

Once built, there will be no way to tell if a language was a submodule language vs "core" language (unless something is explicitly done to differentiate them).

As for making it harder for people to also ship their language, so firstly they wouldn't need to ship their language to NPM if the language was just included in the core package.

But for the sake of argument, lets say someone still wants to ship, nothing should cause issues, you can submodule any repository you want without that repository having to do anything (or even know it is being submoduled).

joshgoebel commented 5 years ago

As for making it harder for people to also ship their language, so firstly they wouldn't need to ship their language to NPM if the language was just included in the core package.

Some maintainers have very different ideas about what "core" means (as in a monolithic JS build)... perhaps that's 50 languages, perhaps it's 5... do we host EVERY single community language on our CDN? I'm not sure that's a given.

firstly they wouldn't need to ship their language to NPM if the language was just included in the core package.

People might want to use packages with npm and the larger node ecosystem. Highlight.js can be (and definitely is) used outside of the web browser... so that effects how they might be packaged... and that effects how easily we might build or not build them. Though I suppose if the grammar maintainer didn't care about that they wouldn't support it (or wouldn't make it easy). So the question is really "should it be hard or should we encourage that?"

Unless you're just suggesting the npm package of highlight.js include "Everything and the kitchen" sink... and if so I don't know if there are any potential cons to that... I do know the JS ecosystem prefers their 1000 tiny packages though...

joshgoebel commented 5 years ago

Right now our build system actually makes it HARDER for you to distribute on npm (or package your code with ES6 modules, etc) because we require the grammars to be in a very particular format so the build script can build them...

But I see you saw that, hence your remove uneeded code to make this work commit.

jaredlll08 commented 5 years ago

Right now our build system actually makes it HARDER for you to distribute on npm (or package your code with ES6 modules, etc) because we require the grammars to be in a very particular format so the build script can build them...

I did run into that when I submodule'd my language, but I wasn't sure if it was my language or highlightJS that required the change, was hoping it would get brought up if / when submodules are being implemented.

Some maintainers have very different ideas about what "core" means (as in a monolithic JS build)... perhaps that's 50 languages, perhaps it's 5... do we host EVERY single community language on our CDN? I'm not sure that's a given.

Unless you're just suggesting the npm package of highlight.js include "Everything and the kitchen" sink... and if so I don't know if there are any potential cons to that...

I addressed this above:

Github Linguist (example), they require the language to be used in "hundreds of repositories", for them, being Github, that makes sense, since the PR would affect those repositories, so for HighlightJS, it gets a bit tricky, you could use the same metric as Github, and that way you could help ensure that

I don't believe supporting every single language that gets a PR made for it is the right move, I think a metric should be used to determine if a language should be supported, just so you aren't left with a bunch of "weekend hobby languages" being distributed in the "Core" package.

As for the

I do know the JS ecosystem prefers their 1000 tiny packages though...

The issue with that is that as it stands, websites aren't using the third-party languages, hastebin for example, which uses highlightJS doesn't support any of the languages on the highlightJS GitHub organization, for them having all the languages would be beneficial, so I doubt they would decide not to use a language, it is probably more that they don't know that the language support even exists.

Some maintainers have very different ideas about what "core" means (as in a monolithic JS build)... perhaps that's 50 languages, perhaps it's 5... do we host EVERY single community language on our CDN? I'm not sure that's a given.

Your docs currently state that you support every language that has support written for it, so according to them, you should host all the languages (like I said above, I don't fully agree with this, I'm just giving some perspective as an outsider of the project who may be reading the docs wanting to add support for their language and expect their support to be added).

joshgoebel commented 5 years ago

I don't believe supporting every single language that gets a PR made for it is the right move, I think a metric should be used to determine if a language should be supported, just so you aren't left with a bunch of "weekend hobby languages" being distributed in the "Core" package.

Well, that's hard. :-) And part of the problem now for sure. :-) I guess I personally imagine a "blessed set" that the maintainers decide on for a "default" build. Regardless of how many languages are in the languages folder or our policy on merging submodules (if we go that route).

The issue with that is that as it stands, websites aren't using the third-party languages, hastebin for example, which uses highlightJS doesn't support any of the languages on the highlightJS GitHub organization

Someone should tell them. :-) "Add them all to one huge package" isn't the ONLY solution to this problem of course. :-)

I may run a paste site again in the future - and it would support whichever languages I decided - regardless of where they were... but I do take your point here - "it's easier if they are just built into the CDN"....

Your docs currently state that you support every language that has support written for it,

And those docs are obviouslyout of date now and there is an issue to update them. :-)

Darkhax commented 5 years ago

From someone who is using highlightJS in my projects, I just want to have generalized support for a wide variety of languages. My use cases and many of the use cases I see are for generalized things like blogs, forums, paste sites, and chat clients. I would like to just drop in one dep and support as many languages as I can rather than going through and adding hundreds of them manually or having people flood the support email with language requests.

I can totally see the use case for some projects where they would want to use specific flavors, but those seem like special case scenarios to me. If my project specifically needs super fancy PHP highlighting adding in a single secondary module is not a big deal. Where as the alternative of managing hundreds of additional deps and running them all through the internal review process sounds like a nightmare.

joshgoebel commented 5 years ago

. I would like to just drop in one dep and support as many languages as I can rather than going through and adding hundreds of them manually or having people flood the support email with language requests.

Sure, but if you throw in everything we're already 1mb of Javascript, and that's not counting the 3rd parties just waiting to be added, so you gotta draw the line somewhere. :-)

jaredlll08 commented 5 years ago

Sure, but if you throw in everything we're already 1mb of Javascript, and that's not counting the 3rd parties just waiting to be added, so you gotta draw the line somewhere. :-)

I don't think there is a realistic fix for the file size without having to remove some languages (and possibly moving them to their own package). Saying that, it is a bit unfair on new languages to be denied because of existing languages that probably warrant their own 3rd party support based on size alone.

As it stands, the current average of file size of all the languages is 5.72kb, but that doesn't necessarily mean that all languages are that size, or even close, you have languages such as ISBL (which I had to look up, and still not finding much on what it actually is, all I have found is that according to wikipedia, it is used for an obsolete DBMS), that file alone is 106kb, removing that language took the final build down from 726kb to 657kb, reducing the final build by about 10%.

There are 3 other languages that are all more than 10x the average language size, (mathematica (95kb), 1c (64kb) and gml (59kb)), I'm not sure how used those languages are, but removing them takes the final build down to 461kb, reducing the final build size by about 40%.

I am NOT suggesting to just outright remove the languages, that would be unfair on the people who submitted the languages, and who are currently using the languages with highlightJS. I am just saying that if a line is going to be drawn, it should affect new languages, and the current languages.

The size of the JavaScript for the language could be part of the metric used to determine if a language should be included or not, obviously with some exceptions such as popularity (as you said above, if the next swift comes out and everyone starts using it, then obviously an exception can be made for the language if the JavaScript is a bit on the heavier side).

joshgoebel commented 5 years ago

Well currently the "default" distributed CDN build is 40kb (I think that means gzipped?)... and includes:

src//languages/shell.js
src//languages/bash.js
src//languages/apache.js
src//languages/perl.js
src//languages/cpp.js
src//languages/ruby.js
src//languages/xml.js
src//languages/java.js
src//languages/objectivec.js
src//languages/properties.js
src//languages/javascript.js
src//languages/php.js
src//languages/ini.js
src//languages/sql.js
src//languages/json.js
src//languages/nginx.js
src//languages/cs.js
src//languages/css.js
src//languages/yaml.js
src//languages/coffeescript.js
src//languages/diff.js
src//languages/makefile.js
src//languages/python.js
src//languages/http.js
src//languages/markdown.js

I'm suggesting we modernize it a bit, but I like the idea of staying on the smaller side for the default. Later when we have a build process that can produce more than 1 output file I'd say have small, medium and large builds, plus custom. :-)

And I'd use popularity as determined by whatever seems the best metric for all the builds... MOST popular goes into small, popular into medium, kind popular into large, etc... but that's just my 0.02. :-)

jaredlll08 commented 5 years ago

I think for the CDN build (that should only be used by browsers), having the most popular languages is good, keeps it light and simple, and if someone needs more languages they could use the Getting Highlight.js page to get a custom build.

I am more coming from the perspective of the Node build however, which as that same page says:

The package with all supported languages is installable from NPM:

Where the file size may be a bit less of an issue as you then have the option to register only languages that you need, as shown on the How to use highlight.js page:

The default import imports all languages! Therefore it is likely to be more efficient to import only the library and the languages you need:
import hljs from 'highlight.js/lib/highlight';
import javascript from 'highlight.js/lib/languages/javascript';
hljs.registerLanguage('javascript', javascript);

joshgoebel commented 5 years ago

The package with all supported languages is installable from NPM:

Well no problem with that, you can require them all or whatever you need. Node is an entirely different beast. I think mostly about the web.

Just me playing with web distributable:

-rw-r--r--  1 jgoebel  staff  130977 Oct 14 19:05 highlight.medium.pack.js
-rw-r--r--  1 jgoebel  staff   71161 Oct 14 19:05 highlight.pack.js

Tons of headroom.

jaredlll08 commented 5 years ago

Well no problem with that, you can require them all or whatever you need

Then is there even an issue with accepting any and all languages (provided relevancy tests pass)? The CDN build won't change (unless those languages are added to the CDN build), the only thing that would be affected would be the NPM build, which is less of an issue as people can add what they need.

(off topic, how exactly are you building the cdn js? When I try build the cdn it gives the same results as the browser build, which has all the languages, so I must be doing something wrong)

joshgoebel commented 5 years ago

Then is there even an issue with accepting any and all languages (provided relevancy tests pass)?

Relevancy tests? Like whether anyone else cares? Even that's a tough one - since I'd say some of the requests right now are for languages no one else cares about other than the submitter. :-)

I also don't think that's a good reason just in and of itself though either... since npm i can fetch 3rd party languages from 100 different places just as fast. If we're just talking a about node npm and yarn already solve the "install it easily and use it" problem.

Then is there even an issue with accepting any and all languages

The problem is maintenance and time. Every new language creates new issues, requires maintenance (many of these languages aren't static), and there seem to be few with time or inclination to fix them. That's why the previous author/maintainer is no longer around, they simply got burnt out and couldn't keep up. So I think a Highlight.js project that EXISTS at all that's a bit more closed is better than an entirely open one where a robot just clicks "merge" but nothing works right and you're just on your own and it just slowly dies.

Could this whole matter have somehow been handled better? Probably, but water under the bridge... welcome to open-source and people volunteering there time for free. :-)

Lots of people want to add languages, few people seem to want to be responsible for maintaining them over time. It's a tough nut to crack. Writing grammars is a hard thing too... so that plays a part for sure.

off topic, how exactly are you building the cdn js?

It's squirreled away on a build server somewhere... but see my bash script in the PR:

https://github.com/highlightjs/highlight.js/pull/2204

You have to list out the languages you want by hand.

jaredlll08 commented 5 years ago

Relevancy tests? Like whether anyone else cares?

Uh, the relevancy tests that use this value, the ones that if there is a detection error cause this:

1) hljs.highlightAuto()
       should be detected as flix:

      AssertionError: expected 'groovy' to be 'flix'
      actual expected

      groovyflix

Those relevancy tests.

Like whether anyone else cares? Even that's a tough one - since I'd say some of the requests right now are for languages no one else cares about other than the submitter. :-)

Not strictly true, right now I think it is fair to say that you have a request from me for my language, which a good amount of people care about (at least in the open source space), there are a ton more in the closed source space.

I also don't think that's a good reason just in and of itself though either... since npm i can fetch 3rd party languages from 100 different places just as fast. If we're just talking a about node npm and yarn already solve the "install it easily and use it" problem.

This just brings us back to the discovery issue, and the security risk (link possibly satire) of just adding a bunch of NPM packages to a project.

The problem is maintenance and time. Every new language creates new issues, requires maintenance (many of these languages aren't static), and there seem to be few with time or inclination to fix them.

Then remove the language, if no one is willing to fix the language, that is a good indicator that no one is using it or cares enough to fix it (The people who absolutely need the language, but don't know how to fix it could simply use an old version of highlightjs, it isn't ideal, but neither is having to carry around a language that no one uses.

Also on this, with the submodule proposal, removing the language would be as simple as deleting 3 lines from a file, the repository will still exist, with all the issues to it, and if someone made a PR to fix the language, adding it back is as simple as 3 lines.

So I think a Highlight.js project that EXISTS at all that's a bit more closed is better than an entirely open one where a robot just clicks "merge" but nothing works right and you're just on your own and it just slowly dies.

Well right now, to an outsider the project seems completely closed, not just "a bit more closed".

As for the robot, firstly think that is a horrible idea, secondly, I think you missed the part about the tests being ran on the languages (Which I will admit is not perfect at all, this issue is a great example of robots causing failure by doing their jobs).

My comment about accepting any and all languages is coming from the perspective that you have a package that is labeled as having all the languages and a package that ships specific languages. I would hope by now that it would be clear that I am against just accepting any PR and letting the fire grow. I am more interested in helping the project get to a state where it can start accepting new languages / styles.

welcome to open-source and people volunteering there time for free. :-)

I am well aware of Open-Source, and volunteering time, I did volunteer my time to come up with a possible solution to most if not all of the issues listed above by @jf990 and am currently contributing time by replying to this issue trying to help this project get to a better state, where discussions like these shouldn't need to happen .

Lots of people want to add languages, few people seem to want to be responsible for maintaining them over time. It's a tough nut to crack. Writing grammars is a hard thing too... so that plays a part for sure.

That is understandable, you never know someone's intentions when making a PR and I don't think there is a proper way to ensure that anyone making a PR would be even be around the next day to fix an issue that arose.

That doesn't necessarily means that this project has to suffer because of them, which is why I feel that the languages that cause issues should be removed, if anyone complains, link them to @isagalaev's blog post about this being an Open Source project, and that if they expect the people working for free to do free maintenance and support for something they use, then they have misplaced expectations.

joshgoebel commented 5 years ago

Those relevancy tests.

Oh, THOSE. :-) Those need to change. It's impossible to deal with 180 languages and keep it all glued together - the more we add the worse it'll get. I don't have an answer yet but I think it might mean many languages just don't auto-detect and perhaps we scope that feature back down to really popular languages... at least as default/distributed...

So say we'd ship 30 that work well instead of 250 that barely work and constantly break. I have ideas that I haven't fleshed out yet. If a language was super important or super distinctive then it could join the elite languages that are auto-detected. :-)

there are a ton more in the closed source space.

Well, then that gets harder to judge. And I wasn't talking about yours - I'm not even sure I know which yours is. :-)

This just brings us back to the discovery issue, and the security risk (link possibly satire) of just adding a bunch of NPM packages to a project.

Same risk here IF we don't have time to review the code we're adding in any case. :-)

Then remove the language, if no one is willing to fix the language, that is a good indicator that no one is using it or cares enough to fix it (The people who absolutely need the language, but don't know how to fix it could simply use an old version of highlightjs, it isn't ideal, but neither is having to carry around a language that no one uses.

It's hard to find people to fix popular languages - it's [working with grammars, regex] not an easy thing to do. Look at all the open issues for Typescript, Javascript, JSX. If we can't get attention on those it's hard to imagine lots of attention for more obscure stuff. Obviously you have an occasional champion for a language, but I'm speak in broad strokes. It's also possible that it (any given language) works "good enough" for many, but that doesn't mean the maintainers don't see the issued files, and that the issues aren't real and don't add up, cause stress/burnout, etc...

Also on this, with the submodule proposal, removing the language would be as simple as deleting 3 lines from a file, the repository will still exist, with all the issues to it, and if someone made a PR to fix the language, adding it back is as simple as 3 lines.

How do we get updates with submodules? Don't those require a whole PR/commit song and dance to bump the versions? I don't think we want to get in the habit of pulling languages in and out all the time, but I take your point - but again see my point on popular languages and issues.

to an outsider the project seems completely closed

Yeah, and that's a problem. We're very open to fixing existing languages (lots of activity there), less so to adding new ones. :-) But again I mentioned "could have been done better" already.

I think you missed the part about the tests being ran on the languages

I think I did [miss it]. We don't have any generic tests that can glance at a language and determine the code quality as well as a maintainer could though...

I did volunteer my time to come up with a possible solution

And thanks for that. :-)

That is understandable, you never know someone's intentions when making a PR and I don't think there is a proper way to ensure that anyone making a PR would be even be around the next day to fix an issue that arose.

Yeah I think all you can do is have standards, set the bar high (I'm talking about letting some into core, not just submodules), ask people to fix things... but none of that means they'll be around a year later...

I do think we should consider some pruning though (of existing languages)... so many things, so little time. :)

joshgoebel commented 5 years ago

My take from reading lots of requests, etc. is that many want it [their language] "in the default build" so that "site xyzzy" who "uses Hightlight.js" can suddenly highlight their favorite language... or "site xyz said they won't add it" and that "get it added to upstream highlight.js"... imagining that that just means it instantly goes into the CDN...

But earlier you made it sound like you cared more about npm than the web builds... If we opened up the repo completely (one way or another) but we still followed our more conservative policy for building CDN assets... that wouldn't help a lot of the people clamoring just to get it added because it still wouldn't "just work" out of the box for people only loading the default minimal CDN build.

I suppose having "bloated" "everything and the kitchen sink" CDN builds as an alternative might help a bit there though...

jaredlll08 commented 5 years ago

Just on the caring more about the NPM builds, as I said earlier, I'm not primarily a web / Javascript developer, nearly all the web developing I have done in the past 2 years has been with ReactJS through NPM, and the current web project im working on, which ideally is going to use highlight.js, is also using react, so there is a bit of a bias to the NPM builds.

I don't think the small cdn build should change unless users want a new language in it, as you said, it should be a "blessed set" of langauges chosen by the core team.

joshgoebel commented 5 years ago

The security thing is a little bit of a red herring to me... to do security right you have to be vigilant, and pay attention to things, etc... not just defer it to someone else... if you say "i trust highlight.js blindly" or "i trust npm blindly", I'm really not sure how it's that's much different. If we just don't have time (which has happened before) and just do the minimal amount of effort to vet packages (one way we could deal with lack fo time)... then just because we merged something wouldn't necessarily make it more secure than some random package on npm.

Esp if we're doing a VERY minimal cursory check just to add them as submodules, and then they build auto-magically and end up in some huge NPM package we provide... I don't see how that helps anything - it just makes US the source of the security problem instead of NPM.

This "closed to new languages" wasn't really a purposeful policy originally (I don't think)... it just sort of happened because there was NO time to review all the new things and maintain the old things and things just got more and more behind until nothing happened...

And now (a very few of us) sort of have a handle on the situation, we're not wanting it to get back to the way it was before... where everyone is burnt out and then the queues just get super long all over again... so we're trying to figure out how to move forward while avoiding getting stuck where we've been before...

We could really use a new build system no matter WHICH way we go here, but I'm not sure who has the time or inclination to work on that.

jaredlll08 commented 5 years ago

it just makes US the source of the security problem instead of NPM

Okay but right now your go to method of handling 3rd party languages is to make a repository on the HighlightJS organization, technically speaking, right now, anyone who has access to one of those repositories can push malicious code and have it be pushed to NPM in their 3rd party language library. Who is responsible then? It doesn't matter who wrote the code, the code is in a repository owned by highlight.js, so even if it isn't distributed in the core language, the blame would still be on the highlightJS organization (I know the blame is on the person who pushed the malicious code, but other people who are angry won't care to look or realize what has gone on).

At least in the submodule idea, the person creating the build (or triggering the script on that server) could do a checklist of the submodules, first find which ones have changed (relatively simple), then either they could do it themselves, or with a few other maintainers, just look at what has changed, a quick glance at the commits / file changes should be enough to ensure there isn't malicious code. Most language files are small so it should be a fast process.

If there isn't time to do a code review on the languages, like if there was an urgent bug in the highlight.js framework itself, that doesn't need languages to update, but does need to be released, then it can just be released without updating the submodules, that way the language files shipped would still be good (as good as the last time they were updated), and the bug gets fixed, regardless of how many languages had updates waiting, if they were all in the core repo the update would take longer, since you would need to know that those languages were okay to ship in their current state, which could take time, which as you said, isn't abundant.

So the submodule idea actually isn't strictly for new languages, it allows for fixing specific parts of the code regardless of how the other parts have changed (could update a single language submodule and leave the rest at "good" versions if a language needed a critical fix for example.)

As for the build system, I don't really see how a new build system would work, at the end of the day you are building the the final JS from the same grammar files, regardless of how you get them.

If you are planning on swapping the grammar file format for a new system, then it would make more sense, granted updating over 180 grammars for that doesn't exactly sound like a fun experience.

jf990 commented 5 years ago

Then remove the language, if no one is willing to fix the language, that is a good indicator that no one is using it or cares enough to fix it ...

this is a really hard choice to make in a library like this, for reasons said in this issue and others. This library has been around a long time and many sites and products using it for a long time are just using the common build not using modern tooling. If highlightjs just stops working for someone for no reason because we remove a language, that would cause lots of issues that would maybe end up here, creating more work for the maintainers, or cause those users to just find a different solution. we have a bit of responsibility to maintain backward compatibility. i would like to say if a language doesn't work then toss it out until someone fixes it, but that is not a reasonable expectation for the users of this library, given its history and expectation of those who are using it.

to do security right you have to be vigilant, ... not just defer it to someone else

this is true no matter how we organize the repo, submodules, monorepo, etc., security is still the responsibility of the maintainers and users will point the finger here if/when an issue comes up. we can't defer this.

It doesn't matter who wrote the code,... the blame would still be in highlightJS

absolutely, in the eyes of the user. with our review and tooling we have to do the best job we can staying on top of these things, to be vigilant

I don't really see how a new build system would work, at the end of the day you are building the the final JS from the same grammar files, regardless of how you get them

I'm not sure. I see value in a new build system, whether that means fixing up the existing build scripts or rewriting I don't know but there's been lots of improvements in node and npm modules that we could take advantage of. The build system should be able to help lessen some of the maintenance burden we've been discussing here. however, that is a fairly significant project in and of itself.

joshgoebel commented 5 years ago

If highlightjs just stops working for someone for no reason because we remove a language

It's not like it would "stop working", it would just stop highlighting a language (an uncommon one) that we depreciated. I don't think it makes any sense to rip out the uber popular stuff (at least not from CDN builds).

i would like to say if a language doesn't work

The problem is all in degrees. :-) If they didn't work at all they wouldn't be in the repo. It's issues and glitches that come up.

absolutely, in the eyes of the user. with our review and tooling we have to do the best job we can staying on top of these things, to be vigilant

The more paranoid stance people say they want us to take the more closed we're going to be - that's just the nature of the game. So you can't say "merge all languages, we understand you have no time" and "oh but guarantee they are sure and safe also while you're reviewing them all very carefully". :-) That is the crossroads we find ourselves at.

Perhaps we need to talk to the maintainers of some other similar projects and see how they handle these things from a policy perspective.

at the end of the day you are building the the final JS from the same grammar files

Yeah, I also think this might be a faulty assumption - although I kind of pivot in the middle of talking, lol. I'm thinking (and still like the idea of) a hilightjs-core and hilightjs-community (not attached to names). Core would do auto-detect and have very stringent requirements and be maintained here - of course being open to PRs to maintain our languages. Key languages could be considered for core (WASM, etc)...

Everything else would go into community, and be managed by the community. The JS file that everyone wants is this community file (in my opinion), NOT core. It could even be built (optionally) as a single "binary" that you load right after highlight.js (or concat the two)... it could have it's own release process, it's own security. Core maintainers could be involved in community (or not). I (for one) am interested in being involved.

I like contributing to many different languages (and I know at least some others are the same) and that type of contribution would be MUCH harder with 100 separate repos - even if they are submodules... I don't want 100 separate issue trackers, etc... so far it seems no one is really in FAVOR of that - it's just one of the ideas we've thought of to avoid putting too much pressure on core maintainers.

highlightjs / highlight.js

Policy: Discuss extras repository for additional languages #2149