Discuss: 3rd party language packages spec

joshgoebel commented 4 years ago

Starting a discussion for an official 3rd party grammar format/spec (directory structure, how modules are exported, etc). I've been working on adding support to the build process for seamlessly building 3rd party grammars. The idea being that you just check them out:

mkdir extra
cd extra
git clone git@github.com:highlightjs/highlightjs-robots-txt.git
git clone git@github.com:highlightjs/highlightjs-tsql.git

And then building "just works":

node ./tools/build.js -t browser :common tsql robots-txt

And high-level grammar testing just works (detect tests, markup tests, etc) in context of the whole test suite. Just run tests like normal. The tests for extra languages are bundled with the extra languages, but magically "just work".

In looking at the existing 3rd party stuff it seems we have two major types of layouts. I think perhaps we could support both.

The slightly flatter npm package:

A more "bundled"/traditional layout:

One alternative to supporting both is to always prefer/require the bundled layout. I prefer this layout slightly as it makes it easier to add multiple languages to a single "package/repo". For example you might bundle Python, Python REPL... bundled layout gives you an easy way to do that (keeping the same directly structure we use internally). Also if we ever again decided to pull a 3rd party into the core library it'd be trivial to do with the bundled layout as it'd just essentially be copying files from one repo to another.

As to how NPM would work with bundled layout I'd suggest an index.js that simply required the individual modules and exported a hash of them by name:

# index.js
const python = require("./src/languages/python")
const pythonREPL = require("./src/languages/python_repl")
module.exports = { python, pythonREPL }

Thoughts?

joshgoebel commented 4 years ago

Also an anti-pattern (in my opinion) I'm seeing is:

module.exports = function(hljs) {
  hljs.registerLanguage('robots-txt', hljsDefineRobotsTxt);
};

module.exports.definer = hljsDefineRobotsTxt;

Resulting in custom registration code that hides the language name/key:

var hljsDefineRobotsTxt = require('highlightjs-robots-txt');
hljsDefineRobotsTxt(hljs);

I'm not sure what the big advantage here is and I'd prefer to instead see:

module.exports = function(hljs) {
  return {
    case_insensitive: true,
    contains: [
    ...

or (same thing really):

function tsql(hljs) { ... }

module.exports = tsl

And README examples like:

var tsqlGrammar = require('highlightjs-robots-tsql');
hljs.registerLanguage('tsql',tsqlGrammar);

IE, what is exported is the grammar function itself, nothing else. Prefer to let people do the registration themselves. This will allow a simpler build process and also I like that it keeps the naming explicit rather than implicit.

joshgoebel commented 4 years ago

From another thread about testing 3rd party languages in isolation.

Feel free to prototype some sort of prospective shared .travis.yml in https://github.com/highlightjs/highlightjs-shexc.

That might wait until later (or require someone else to step up). My focus right now is to make things "just work" in the context of a working highlight.js checkout. (Since you need that to build the actual library anyways, etc). And of course that's immediately most useful to me, as a maintainer of the library - everything in one place.

So checkout highlight.js and then vendored language projects, run full highlight.js test suite. No 3rd step.

The other side of things (as you suggest) would be to have your language stand-alone... then you'd pull in highlight.js as a dependency via npm/yarn... and you might have some sort of scaffold to run your custom tests plus "standardized" detect/markup test... of course currently none of the testing stuff ships with our npm library at all since the library itself is just a build product of the actual source.

Open to discussion this element of things here, but it might be outside the scope of what I'm currently working on.

joshgoebel commented 4 years ago

Another question is how to distribute "web"/CDN versions of languages. Currently I was imagining the source files would all be CommonJS (module.exports)... so they'd work with Node out of the box, but require building for distributable/web usage. I'd be game for ES6 modules, but Node.js isn't 100% there yet, so CommonJS is probably simpler for now.

I think some people may want an easy way to just get the CDN ready version... of course our build process could handle this and then drop CDN ready "binaries" in the dist directory of 3rd party languages (that they could then check-in/distribute)... Short of that I think we're looking at something like UMD? But that seems pretty ugly and complex for this... no?

Ok, actually Solidity does it with just a custom function name (which of course pollutes the global namespace also), but I think this is a bit messy:

// in the external solidity.js file
function hljsDefineSolidity(hljs) {
...

// and then to register after you loaded the JS
hljs.registerLanguage('solidity', window.hljsDefineSolidity);

vsoch commented 4 years ago

I don't have a lot of expertise with node, but if you are serving the languages repos under a common org, you could have each build a custom package with the language, and then (given credentials to release to a CDN) push up there with some identifier for the language. For repos outside of the org you could still share the workflow, but perhaps just deploy back to GitHub for others to link to or download. It seems like a large burden for the core to always include every language, and on the flip side, any community member should be able to build some custom language.

joshgoebel commented 4 years ago

This isn't really to discuss wholesale library distribution per se. You can find some thoughts on that in the related thread here:

My main goals here:

Make it easy for people who want to build/package custom versions of the library that include 3rd party languages seemlessly.
Make it easy for 3rd party languages to be included in custom builds.

If you are using npm AND the 3rd party language has also been packaged with npm you already have some choices there... you could just do your packaging all with npm/yarn/etc... but that's not helpful if say you want to build a monolithic highlight.js library that includes a custom build of languages (including 3rd party languages). And perhaps not everyone wants to provide npm package either?

Regarding CDN usage, that is on topic though for sure... if someone chooses to package their language as a npm library then the source will already be freely hosted on https://unpkg.com, etc. So that gets into how the JS is structured (and I mentioned some of this above).

My preference would be that anyone could "one line include" a library, such as:

<script src="https://unpkg.com/highlightjs-solidity@1.0.9/solidity.js">

And have it "just work", ie do auto-registration, same as our CDN builds do... right now that doesn't happen you have to instead do something like:

<script src="https://unpkg.com/highlightjs-solidity@1.0.9/solidity.js">
<script type="text/javascript">
    hljs.registerLanguage('solidity', window.hljsDefineSolidity);
</script>

With the define function (hljsDefineSolidity) being different for every single language. So the question is can we/should we have a small registration stub that is included inline in EVERY language file (so the raw files could be loaded via CDN)... or should language authors use a build process/script to produce a CDN distributable (highlight.js could provide it)... such that we'd have a process something like:

git clone hightlightjs
cd highlightjs
mkdir extra; cd extra
git clone highlightjs-solidity
cd ..
EXTRA_ONLY=1 node ./tools/build cdn
# this would automatically build the CDN version and place a file in say
# extra/highlightjs-solidity/dist/solidity.min.js

Except really the workflow might be that a language maintainer worked inside a highlight.js install the whole time... so really then pushing a new version of your language would only involve running a new CDN build and then checking the resulting updating distributable into your repo. IE, the "bootstrapping" would already have been done long ago.

tajmone commented 4 years ago

@yyyc514:

I prefer this layout slightly as it makes it easier to add multiple languages to a single "package/repo". For example you might bundle Python, Python REPL... bundled layout gives you an easy way to do that (keeping the same directly structure we use internally).

I agree, especially if this would allow including third party languages as sub-languages (e.g. an HTML package that includes 3rd party PHP and JavaScript packages — ideally, this should be easily done via Git submodules, without the included package needing to be designed in any special way for this to work.

The idea is that if a user needs to create a custom HLJS package only targeting a given language (e.g. HTML), that package is allowed to include required languages dependencies (e.g. PHP and JS).

joshgoebel commented 4 years ago

I agree, especially if this would allow including third party languages as sub-languages (e.g. an HTML package that includes 3rd party PHP and JavaScript packages

It would, though that's just a side benefit of the layout. Currently what I have so far allows only a single language per package, but that should be pretty easy to change if the packaging format has a simple enough structure - as this does.

ideally, this should be easily done via Git submodules, without the included package needing to be designed in any special way for this to work.

I hadn't been thinking of that. If they were so closely related, why not just a single repo?

Also the current directory structure doesn't allow for that easily I don' think... for example the language might be in:

/src/languages/tsql

While the tests would be in:

/test/detect/tsql
/test/markup/tsql

So 3 different folders... if you wanted to use git submodules or subtrees you'd want more of a parent heavy layout such as:

/grammars/tsql/
/grammars/tsql/test/detect
/grammars/tsql/test/markup

Wouldn't you? That's pretty different from EITHER of the two formats that seem to already be popular.

joshgoebel commented 4 years ago

Although I suppose you could just state that multiple packages in a repo could be bundled into a single directory (regardless of the layout we choose)...

So you'd have:

/grammars/tsql/src/languages/tsql
/grammars/html/src/languages/html
/grammars/html/test/markup/html

Etc... Where as /grammars/html, grammars/tsql would just link to the child repo... but if that's the case I'm not sure if the bundling in one package is that huge of a win... already the idea is to put languages in "extras" or another blessed folder:

So why not just have 3 repos and a copy-paste in the README to check them all out into extras. I guess I'm not 100% seeing the benefit of having a repo with just 3 child repos with submodules, it sounds like a VERY complex way to get the job done... if it was me it'd probably just be one repo to begin with though.

joshgoebel commented 4 years ago

The idea is that if a user needs to create a custom HLJS package only targeting a given language (e.g. HTML), that package is allowed to include required languages dependencies (e.g. PHP and JS).

If the language specified the dependencies they could be auto-downloaded (and all in their own repos)... but now we might be over-engineering a bit. :-)

joshgoebel commented 4 years ago

Worst case we could always also include a hljs-grammar.json file or something (or allow configuration inside package.json) and allow packages to specify their own layouts - though I'd prefer to have a blessed layout instead (simpler).

tajmone commented 4 years ago

So why not just have 3 repos and a copy-paste in the README to check them all out into extras. I guess I'm not 100% seeing the benefit of having a repo with just 3 child repos with submodules,

Because each language might be maintained by different parties, and submodules would ensure that all sub-languages (which are truly dependencies) always mirror their latest upstream version.

it sounds like a VERY complex way to get the job done... if it was me it'd probably just be one repo to begin with though.

I don't see how it's more complex than any other GitHub project that includes third party dependencies. Unless there are some configuration files that allow specifying the third party languages to be included in the build, using sub-modules seems the GitHub way to handle this.

The goal should be to have a DRY approach to sub languages. Various syntaxes are used both on their own as well as a sub-language in another syntax (e.g. PHP); other syntaxes might exist only as sub-languages (e.g. "docs as comments" notations that are language agnostic).

For the sake of an example, if I were to create an HTML syntax for HLJS, and add sub-languages support for PHP, JS, and others by manually copying and pasting them into my repository, then end-users won't be enjoying any updates and bug fixes to those sub-language unless I update them by re-pasting their updated version in my repository.

As an end users wanting to create an ad hoc HJLS package to highlight HTML source example in my website, I'd expect that each build offers the latest version of all languages involved.

tajmone commented 4 years ago

If the language specified the dependencies they could be auto-downloaded (and all in their own repos)... but now we might be over-engineering a bit. :-)

Indeed, but I think that both options should be available. I've brought up the issue because I've encountered this problem and its various solutions in editors syntaxes.

For example, using again HTML and PHP (which are well known practical examples), in many editors the PHP package actually contains multiple PHP syntaxes, one for standalone PHP code and another for embedded PHP. Which approach is the best would depend on how the upstream syntax is designed.

You can see this in the PHP package that ships natively with Sublime Text:

https://github.com/sublimehq/Packages/tree/master/PHP

So which approach to use to handle sub-languages might depend on the context at hand — in some case you'd want to include the third party HLJS syntax as is, other times you might need to tweak it because you need only a subset.

Ideally, for languages that are also embeddable into other syntaxes their maintainers would take extra steps to ensure their re-usability (e.g. like the dual syntax example of ST).

I just thought that these are worthy considerations to keep in mind in this topic.

ericprud commented 4 years ago

@yyyc514 , do you have a strawman repo we can align to in order to expedite migration and make our design contributions more fruitful? Having something to play with would make me more useful. And of course, I'm happy to use https://github.com/highlightjs/highlightjs-shexc to try out ideas, e.g. how to detect when one module advertises auto-detect patterns which steal from others. I very much like the goal of being able to mix and match CDN-backed modules.

joshgoebel commented 4 years ago

@ericprud You can use my https://github.com/yyyc514/highlight.js/tree/squash_build_pipeline branch.

mkdir extra
# checkout some repos here

Currently I have:

% ls -l extra/
drwxr-xr-x   8 jgoebel  staff  256 Dec 17 12:24 highlightjs-lustre
drwxr-xr-x   8 jgoebel  staff  256 Dec 18 14:55 highlightjs-robots-txt
drwxr-xr-x  10 jgoebel  staff  320 Dec 17 12:31 highlightjs-solidity
drwxr-xr-x  12 jgoebel  staff  384 Dec 17 12:31 highlightjs-structured-text
drwxr-xr-x  10 jgoebel  staff  320 Dec 17 12:25 highlightjs-tsql

All of these "just work" with the current build pipeline as it detects if they are a bundle (like tsql) or a single language (robots-txt)... multiple languages per bundle aren't supported yet.

joshgoebel commented 4 years ago

Note by default you may get a bunch of test failures since many of these don't provide proper detect tests and they also conflict with core library auto-detection.

joshgoebel commented 4 years ago

As an end users wanting to create an ad hoc HJLS package to highlight HTML source example in my website, I'd expect that each build offers the latest version of all languages involved.

I'm just not convinced most users needs are that complex... creating bundles within bundles. @tajmone

Someone who wanted to ship a complex bundle of languages like that in my mind would:

checkout highlight.js
checkout all their individual repos into extra (ie, checkout HTML, php, XML, php7)
build the final product
deploy/push that final product to git/CDN, etc

So the deliverable would be the built product. And someone wanted to start from 0 would just collect the repos in extra themselves.

But I'm not entirely opposed to the idea. I'm more opposed to needlessly making the packaging format/structure more complex to accommodate an edge case. I'm still pretty attached to the "classic" layout, what from what I can tell wouldn't easily support a ton of submodules, or is there some way of doing that that I'm missing?

joshgoebel commented 4 years ago

I'm more focused on the directory structure than HOW people manage that. I know a lot of people hate git submodules, but whatever... The build pipeline should only care if it can find the files in the proper places - not whether you are using SVN, git, mercurial, 28 git submodules, etc... that's a decision up to the developer. :-)

joshgoebel commented 4 years ago

If we were to support "modules within modules" I'd probably do it by just allowing someone to specify multiple source extra language dirs... so you could use BOTH:

./extraLangages
./myCustomPack

And within those would be standard 3rd party packages. It wouldn't matter than extra was 5 separate repos and myCustomPack was a SINGLE huge repo with 5 subdirectories each acting as individual packages.

To me that's an uglier way of arranging things, but if someone felt the ability to version and keep everything separate was worth the effort, then to each their own.

joshgoebel commented 4 years ago

@tajmone I think your comments on PHP might be wading into the other topic even: https://github.com/highlightjs/highlight.js/issues/2330 :-)

joshgoebel commented 4 years ago

@tajmone I changed the topic name slightly... Some of your thoughts sound like you're talking a bit more bout packaging and delivery concerns where-as the first goal here is to come up with (or consolidate on) a simple repository layout that "just works" to allow people to easily build highlight.js with custom languages - without having to manually copy them into src/languages and copy over tests, etc...

Ie, the goal is "self contained language module"... how to take 4-5 of those and bundle them into larger packaged distributions might be a bit out of scope here, I'm not completely sure. I was merely noting that the classic folder layout made this easy to do structurally - just follow the same conventions the core library is already using - which seems logical.

It may be that that what we come up with here is more simple/foundational and then you can lay more complexity on top of it later. So far you're the first person to express interest in a very complex bundled language with multiple sub-dependencies, etc.

Most people just need a way to easily contribute a single language that others can then easily re-use.

tajmone commented 4 years ago

Some of your thoughts sound like you're talking a bit more bout packaging and delivery concerns where-as the first goal here is to come up with (or consolidate on) a simple repository layout that "just works" to allow people to easily build highlight.js with custom languages - without having to manually copy them into src/languages and copy over tests, etc...

Sounds good to me. The priority should always be simplicity of use aimed at basic usage. Of course, keeping an eye open on more advanced need might always help shaping things so that these can still be achieved without having to resort to hacking away from the standard way of doing things.

From your other comments it looks like authors won't be prevented from customizing their own languages or packages, so it seem the right direction anyhow.

Serhioromano commented 4 years ago

I think the most problem is language support delivery. For example, I added support for a new language. How developers will add it to their current apps? I added language years ago and I still do not see support for it in so many apps I use. I have to manually load language and process codes through it. That is not plug-and-play.

I think the convention is the keyword here. You create package a certain way and it immediately appears on all highlights codes highlights. You do not have to update the core highlights file.

I see it that when HLJS finds a new language tag it first looks for this language support in a subfolder lets say ./langs/lang-tag, if it does not find it, then it looks in public CDN, then loads that language dynamically.

We can add hljs initialization option "check_cdn" so developers can turn off none local js file loading. And to avoid this, developers can run a command that will fetch all possible language tags to /langs folder.

It means that we actually get language as a package. Lets' say we have a convention for the package name. For instance hljs-pkg-html, where html is a language tag. This way it is simple to fetch any package with CLI command like hljs install HTML. It is also easy to find all packages that start with hljs-pkg-* and install them locally with hljs install all. And it will be easy to check the existence of language tag support.

This way we give a very flexible way to manage languages with one command and for those who are not concerned about loading JS files from CDN, just include hljs file and it will highlight any language even the one that was approved just yesterday.

This is what it means to contribute to language highlight. For instance, I go to one site where they publish articles like gitbook and start publishing my article and find that my language is not supported. I create the repository, add support, get approved by hljs, go back to my book and my code examples are highlighted.

tajmone commented 4 years ago

This way we give a very flexible way to manage languages with one command and for those who are not concerned about loading JS files from CDN, just include hljs file and it will highlight any language even the one that was approved just yesterday.

This sounds very similar to what you can do on the highlight.js website, where you can create a custom package by picking the languages you want to include, directly from the download page, except that you'd like this to also be accessible via command line commands and/or web API, if I understood correctly?

joshgoebel commented 4 years ago

@Serhioromano I commented to most of your thoughts over at https://github.com/highlightjs/highlight.js/issues/2149 because they are a bit off-topic here. You're touching on the bigger picture, of which this is a small part.

It means that we actually get language as a package. Lets' say we have a convention for the package name. For instance hljs-pkg-html, where html is a language tag. This way it is simple to fetch any package with CLI command like hljs install HTML. It is also easy to find all packages that start with hljs-pkg-* and install them locally with hljs install all. And it will be easy to check the existence of language tag support.

That's what this issue is headed towards, though I'm not 100% sure a tool is needed. And I don't think the naming should matter that much - though a guidelines isn't bad. The content of the repo should be what matters, and that's what we're discussing here. So right now if you get my WIP branch installing a new language is just:

git clone [language_repo]

I'm not opposed to a tool, but someone else will likely have to step up to write it. Combine small tool plus a json file of supported 3rd party languages and then you have exactly what you describe... add a few, build a new deliverable, and throw it into your application. That's the goal here.

Actually I think with the naming idea you were getting at that you won't need a central JSON file of languages, and that's true but there are other benefits to having a centralized list. But that's such a small piece of the puzzle anyways.

This way we give a very flexible way to manage languages with one command and for those who are not concerned about loading JS files from CDN, just include hljs file and it will highlight any language even the one that was approved just yesterday.

Yep.

joshgoebel commented 4 years ago

This sounds very similar to what you can do on the highlight.js website, where you can create a custom package by picking the languages you want to include, directly from the download page, except that you'd like this to also be accessible via command line commands and/or web API, if I understood correctly?

Well, it's probably already available via web API, just undocumented, since it's just a form post, although I'm not sure is CRSF protection makes that more difficult or not. @isagalaev Any thoughts on that?

But that's only for core languages of course... and for core languages we already have console build commands also... this missing pieces is that there is no easy way to plug in random 3rd party languages, which this issue should solve.

At that point it'd be much simpler for someone to come along and build more advanced abstractions on top once the basic framework is in place.

tajmone commented 4 years ago

Actually I think with the naming idea you were getting at that you won't need a central JSON file of languages, and that's true but there are other benefits to having a centralized list. But that's such a small piece of the puzzle anyways.

Centralized lists are excellent when all you need is to package the official languages into your bundle. But sometimes tweaked packages are necessary too. As a real-case usage example, my syntax for "PureBasic" pertains to a language which is actively being developed and which now and then introduced new keywords, sometimes renames them, or abrogates some keywords. The main package is designed to cover all versions of the language, older an current alike. But when working on documentation for that language I usually build an ad hoc package covering only the keywords of the latest version.

Similar version-specific packages aren't likely to be included in a central list because in the vast majority of cases the syntax that covers all versions of the language is what most people need, and including different versions of the same syntax would be a burden and add unnecessary complications.

joshgoebel commented 4 years ago

Someone could always git fetch or otherwise manually drop a package in (or write one by hand)... having a centralized list would only be for convenience and directory purposes.

joshgoebel commented 4 years ago

and including different versions of the same syntax would be a burden and add unnecessary complications.

There has been some discussion on this kind of thing over int he Python 77 vs Python 90 thread.

ericprud commented 4 years ago

This sounds very similar to what you can do on the highlight.js website, where you can create a custom package by picking the languages you want to include, directly from the download page

Someone could always git fetch or otherwise manually drop a package in (or write one by hand)... having a centralized list would only be for convenience and directory purposes.

If I understand correctly, the original goal was to stop having to add every language module to the highlightjs repo and build process and that the interface described by @tajmone above currently works from the languages added to that repo. The interface could work from some centralized registry of contributions, perhaps a wiki like:

language	classname	repo	CDN
JSON	json	http://github.com/group1/highlightjs-json.git	highlightjs-json
JSON	json	http://github.com/user2/highlightjs-json.git	highlightjs-json-fancy
ShExC	shex	http://github.com/ericprud/highlightjs-shexc.git	cdn1-shex

Alternatively, that interface could accept repos by URL, which it would clone, build, and package. Either of these would allow competing modules for e.g. JSON.

ericprud commented 4 years ago

My mild preference would be that a web apps load a revealing version of highlightjs à la:

var highlightjs = (function () {
  function registerLanguage (name, module) {...}
  return { registerLanguage }
})()

and every language automagically register itself:

(function () {
  const shexcModule = ...
  highlightjs.registerLanguage('shexc', shexcModule)
})()

This means adds only one global: highlightjs and just have to load the modules.

<script src="...highlightjs"></script>
<script src="...highlightjs-shexc"></script>
<script src="...highlightjs-javascript"></script>
...
<pre class="hljs">...</pre> <!-- autodetect language -->
<pre class="hljs shexc">...</pre> <!-- force language -->

If some module wants to offer a control interface, they can always shove it into window, but I've not noticed this in existing highlightjs modules.

Without some repackaging, users don't have control to create multiple instances of highlightjs with different registered languages, but since you can force the language in the class, I think that's OK.

I think it will confusing for users and a support burden for highlightjs maintainers if there are two tiers of languages, embedded and "extra". If folks want to load a single document, they can use the interface discussed above to create monolithic modules and host them someplace on their own site. Otherwise, I really like the idea of having a minimal highlightjs which is always supplemented by one or more language-specific modules. This will also make sure that the modular approach isn't a "second class citizen".

joshgoebel commented 4 years ago

If I understand correctly, the original goal was to stop having to add every language module to the highlightjs repo and build process and that the interface described by @tajmone above currently works from the languages added to that repo.

Well that wasn't a direct goal, per se. It was a necessity because the core team doesn't have time to maintain all these grammars in the core repository. So that's the world we live in for now.

The interface could work from some centralized registry of contributions, perhaps a wiki like:

More likely a JSON file that builds a nicer list... the official README list could even be built from such a file once it exists.

This means adds only one global: highlightjs and just have to load the modules.

That's the idea... though right now the registration functions end up in the global namespace... if one uses the 3rd party modules RAW - vs building them into say CDN modules, which is slightly nicer.

Without some repackaging, users don't have control to create multiple instances of highlightjs

Don't think this has even been suggested...

I think it will confusing for users and a support burden for highlightjs maintainers if there are two tiers of languages, embedded and "extra". If folks want to load a single document, they can use the interface discussed above to create monolithic modules and host them someplace on their own site. Otherwise, I really like the idea of having a minimal highlightjs which is always supplemented by one or more language-specific modules. This will also make sure that the modular approach isn't a "second class citizen".

Well, there will always (or for the foreseeable future) be core languages and 3rd party languages... so when you're dealing with the raw source, that is a reality. What we're doing here is trying to fix it so that you only deal with that at the source level... and then you BUILD a single monolithic library - or you build a CDN with all the modules... and to help those maintainers bundle a "CDN ready" version of their language if they wish.

Someone could come along later and build a "community" library which collected say 300 languages... if they wanted to maintain it and deal with support, hosting, etc...

joshgoebel commented 4 years ago

Either of these would allow competing modules for e.g. JSON.

I haven't looked at how that works now (for my new branch), but this is definitely something we should support. Though if someone wants to bundle BOTH jsons it gets a little more complicated. It's easy to do this manually though just be registering languages over top of the original or with a new name... so the support is already there at the foundation level.

I would imagine the correct behavior if you installed extra/json would be that it would take the place of core/json. Yes?

joshgoebel commented 4 years ago

I think it will confusing for users and a support burden for highlightjs maintainers if there are two tiers of languages, embedded and "extra".

The point is that we don't maintain extra or 3rd party languages. They are maintained and supported by the community. We seem to be willing to keep a list of them and host them (at the highlightjs organization), so that gives them a place to live... but they are still very much separate from the core languages.

tajmone commented 4 years ago

The point is that we don't maintain extra or 3rd party languages. They are maintained and supported by the community. We seem to be willing to keep a list of them and host them (at the highlightjs organization), so that gives them a place to live...

I think that it makes sense that highlight.js should keep some of the most commonly used languages as part of its main package. Syntaxes like HTML, JSON, markdown, etc. and popular languages should be part of highlight.js repository because end users expect them to be always present and working with every HLJS versions — unlike more uncommon languages like my PureBasic and Alan IF syntaxes, which are intended for a niche user base (if anyone is using them at at all).

The problem with 3rd party syntaxes is that they are often created out of need for a particular project, and might not necessarily be maintained after the original need is over. I myself haven't update my syntaxes in the past year (even though the language they refer to was updated), simply because I'm not currently working with them as much as before.

Ensuring that popular languages are always available and working in the main HLJS repo and distribution is IMO an important goal — I believe that the vast majority of end users want a simple-to-use highlighter that they can trust will handle properly mainstream syntaxes and languages.

joshgoebel commented 4 years ago

The problem with 3rd party syntaxes is that they are often created out of need for a particular project, and might not necessarily be maintained after the original need is over. I myself haven't update my syntaxes in the past year (even though the language they refer to was updated), simply because I'm not currently working with them as much as before.

Exactly, you're part of the problem!!!! Just joking. :-p

I'd actually go even further and like to see some less popular languages moved out of core entirely into 3rd party modules, but I don't know if we'll ever get there or not. :-)

I believe that the vast majority of end users want a simple-to-use highlighter that they can trust will handle properly mainstream syntaxes and languages.

And that in and of itself is a hard enough goal for active languages. :-)

Serhioromano commented 4 years ago

@yyyc514

You're touching on the bigger picture, of which this is a small part.

That is right, this is a smaller part, but it is important to get into an account that big picture here. Otherwise, there will be a limitation of how you make big-picture later.

I'm not opposed to a tool, but someone else will likely have to step up to write it.

That is right. I can do that. But if there is no name convention, you can do so much less. You example git clone [repo]. That is nice. What if I want to gel a list of all supported languages? I can find all repos that starts with hljs-pkg- and then look into package.json to get language details. This is a set of commands that I think will be possible

hljs install hml
hljs uninstall html
hljs search [hml]  //search either one language if given or all possible.
hljs list  // list all installed languages

But the beauty of t is that you do not need to do anything for this to function, only make convention for repository naming.

joshgoebel commented 4 years ago

That is right. I can do that. But if there is no name convention, you can do so much less. You example git clone [repo]. That is nice. What if I want to gel a list of all supported languages?

We'll be keeping a list of those anyways for README purposes, so they'd be in a JSON file for discoverability... so listing all the "known" ones would be trivial. If someone wanted exposure for their language I'm not sure why they wouldn't want it in the main list. And for "manual" (outside the list) installs (which I'd think would be rare) you could still fall back to simple git clone.

But the beauty of t is that you do not need to do anything for this to function, only make convention for repository naming.

But we already want an official list to track what languages we support, have some meta-data, organize that in a single place, etc...

Nothing against a tool like the one you're describing, just I think it'd work with a built-in packages DB that we'd maintain. And I think it's a bit overkill... but it looks like it'd probably a pretty thin wrapper around git, so perhaps it's not so bad a thing.

One note is eventually we'd like a command-line tool to actually highlight things, or make testing the highlighter easier...

hljs testfile.js
# renders highlighted output here
# auto-detects JS
# perhaps shows some meta-data

So I'm not sure if all those things can live in a single executable or not, they probably could... and maybe hljs parse isn't even so bad (using a subcommand). I guess originally I just hadn't imagined many uses for the executable. :-)

joshgoebel commented 4 years ago

This may also be a moot point. We do intend to enforce naming here at the highlightjs organization just to stay organized... so any repos here should automatically follow the highlightjs-[language] naming scheme. It's only people hosting on their own domain we'd have no control over (and I don't want control over).

And how to deal with multiple applicants for a single language is still a question.

joshgoebel commented 4 years ago

I've updated my canonical example of a "simple" single language and my current thoughts on how it should look:

https://github.com/highlightjs/highlightjs-robots-txt

I also went ahead and actually published it to npm, which I had not previously actually done.

./LICENSE
./test
./test/detect
./test/detect/sample.txt
./test/markup
./test/markup/sample.expect.txt
./test/markup/sample.txt
./dist
./dist/robots-txt.min.js
./README.md
./package.json
./src
./src/robots-txt.js

Most notably:

The only thing exported now is the registration function (no more definer)
Moved the source into src instead of having it in the root, this seems cleaner and more organized
A CDN module is published in dist which can be included with a single line of JS (the highlight.js build system can auto-generate this for maintainers)
When using Node/build toolchain you need to manually registerLanguage after loading the grammar itself (vs using a definer function)

Solved issues:

A single standard way to register all languages, you don't need to know the name of the definer function or check whether someone has exported it as definer or default.
If you only need to include a static JS file from a CDN etc, it's now a one-liner, not two. Making it just as simple as 1st party CDN modules.
Solves the practical problem of "how do I just use this on the web" by including a CDN distributable and integrating this into the HLJS build process so maintainers get it "for free"
- Allows the src to simply be the raw source and only export a module, not worry about which context it's being used, avoids the need for browser stubs, etc.

ericprud commented 4 years ago

I think I've conformed to all of the updates except getting rid of the definer function 'cause I use it web pages to register multiple highlighter with multiple starting productions:

<script>
  function hljsDefineTExpr (highlightjs) {
    const ret = hljsDefineShExC(highlightjs, 'tripleConstraint');
    ret.disableAutodetect = true;
    return ret;
  }
  hljs.registerLanguage('shexc', hljsDefineShExC);
  hljs.registerLanguage('texpr', hljsDefineTExpr);
  (['DOMContentLoaded','load']).forEach(e => addEventListener(e, init, false));
  let inited = false;
  function init () {
    if (inited) return;
    inited = true;
    const blocks = document.querySelectorAll('.shex,.texpr');
    [].forEach.call(blocks, hljs.highlightBlock);
  }
</script>

(This after I endorsed not polluting global space with e.g. hljsDefineShExC...) Any thoughts for how to do this elegantly or is this a bit beyond the design goals? It's published at unpkg in case anyone wants to experiment with it. A language sample would be:

<url1> { pre:local IRI AND @<url2> }

joshgoebel commented 4 years ago

'cause I use it web pages to register multiple highlighter with multiple starting productions:

I'm not sure what you mean. I don't follow "multiple starting productions" and I'm not sure which specific part of your example I should be looking at. We already have an API for dependencies : requireLanguage. See arduino for example:

var ARDUINO = hljs.requireLanguage('cpp').rawDefinition();

joshgoebel commented 4 years ago

@ericprud

"cat src/shexc.js | minify --js > dist/shexc.min.js"

And of course just minifying isn't the same as a CDN build of the file. CDN example:

https://github.com/highlightjs/highlightjs-robots-txt/blob/master/dist/robots-txt.min.js

If you try the latest branch of my build work it'll generate this distributable for you whenever you do a CDN build. (the intention is after that you check it into your repo or publish it though so people have easy access to it)

Or perhaps you're trying to explain why the simple CDN usage doesn't work for your language, but if so I haven't understood properly yet. :-)

ericprud commented 4 years ago

'cause I use it web pages to register multiple highlighter with multiple starting productions:

I'm not sure what you mean. I don't follow "multiple starting productions" and I'm not sure which specific part of your example I should be looking at

The highlightjs-shexc module colors terms differently depending on whether they are inside or outside '{}'s. For instance, the starting state sticks a title class on IRIs outside and one of the nested states, tripleExpression, sticks a name class on IRIs inside. The hackery above allows me to call the definer with no args to get the default behavior and to call with tripleExpression to make the highlighter act like it's already inside '{}'s (good for highlighting fragments).

joshgoebel commented 4 years ago

That's what the requireLanguage API is for:

  function hljsDefineTExpr (hljs) {
    const ret = hljs.requireLanguage("shexc").rawDefinition(opts);
    // ...
    return ret;
  }

ericprud commented 4 years ago

@ericprud

"cat src/shexc.js | minify --js > dist/shexc.min.js"

And of course just minifying isn't the same as a CDN build of the file. CDN example:

https://github.com/highlightjs/highlightjs-robots-txt/blob/master/dist/robots-txt.min.js

Indeed it is not; it needs to be wrapped with hljs.registerLanguage("shexc", ...).

If you try the latest branch of my build work it'll generate this distributable for you whenever you do a CDN build. (the intention is after that you check it into your repo or publish it though so people have easy access to it)

I probably missed something in the docs but I've not succeeded on invoking that on the hierarchy copied from the robots-txt highlighter. I tried invoking like:

extra/highlightjs-shexc$ BUILD_DIR=dist node -e "require('../../tools/build_cdn').build().catch(e => console.warn(e))"

but tools/lib/language.js seems to look in ./src/languages/ rather than just `./src':

Error: ENOENT: no such file or directory, scandir './src/languages/'

Or perhaps you're trying to explain why the simple CDN usage doesn't work for your language, but if so I haven't understood properly yet. :-)

I promise I wasn't being that clever.

joshgoebel commented 4 years ago

Build is meant to be run from the root dir, and you need to run build, not try and hack into the subbuilds. (no idea what you're running into exactly, but I always run build from root and it "just works", so it's gotta be something in your usage)

% cd work/highlightjs
% node ./tools/build.js -t cdn
% ls -l extra/highlightjs-shexc/dist/shexc.min.js          
-rw-r--r--  1 jgoebel  staff  1955 Dec 23 20:43 extra/highlightjs-shexc/dist/shexc.min.js

joshgoebel commented 4 years ago

BUILD_DIR=dist This isn't right for one. :-) Build builds the whole project (and updates extra repositories as a side effect). You can't build JUST your language currently.

More work should be done on doing things form the context of the language package itself, but that hasn't been my focus.

joshgoebel commented 4 years ago

I wonder if we need another target? ../../build grammar or some such? That would just build a CDN distributable for the current language.

gusbemacbe commented 4 years ago

Do I need to do something for the Cyber repository?

joshgoebel commented 4 years ago

@gusbemacbe I'd suggest updating to the new file/directory placement (as mentioned above):

./LICENSE
./test
./test/detect
./test/detect/[name of test].txt
./test/detect/...
./test/markup
./test/markup/[name of test].expect.txt
./test/markup/[name of test].txt
./test/markup/...
./dist
./dist/robots-txt.min.js
./README.md
./package.json
./src
./src/robots-txt.js

If you put your detect and markup files in those folders then the new build/test system we're working on can run them automatically and you won't need your own test scaffold any longer - though if you're using CI or something now we haven't got that piece of the puzzle figured yet. We'll also provide a way to easily generate the CDN files in dist.

highlightjs / highlight.js

Discuss: 3rd party language packages spec #2328