highlightjs / highlight.js

JavaScript syntax highlighter with language auto-detection and zero dependencies.
https://highlightjs.org/
BSD 3-Clause "New" or "Revised" License
23.51k stars 3.57k forks source link

Task: 3rd party grammar template repository #3038

Open joshgoebel opened 3 years ago

joshgoebel commented 3 years ago

Copying in content from @tajmone at https://github.com/highlightjs/highlightjs-alan/issues/5#issuecomment-789643133


I think it would make sense if you created a repository template for custom HLJS syntaxes, which developers can use to create the codebase for their new languages.

The template would provide a boilerplate, with all the required files and directories structure (a sample language to use as a reference), repository settings (.giattributes, .gitignore, .editroconfig, etc.) and a sample README.md with instructions, which end users will only need to adapt to their new language (instructions for the developer, but also those for end users, as for this Issue).

This would simplify creating new languages for both newbies and experts alike, for with a single click they'd have a robust starting point, and newbies will also find the needed instructions.

At least, this is what I do when I'm dealing with commonly shared repository settings and structures, to avoid useless and repetitive tasks.

Furthermore, having an officially sanctioned template would dispel any confusion regarding the proper repository structure to adopt, since not all independent syntax repos are abiding to the same structure right now.

joshgoebel commented 3 years ago

Anyone wanting to help with this one could start with https://github.com/highlightjs/highlightjs-robots-txt (for layout and structure) and then also look at the other 3rd party grammar repos and borrow any great documentation ideas seen there.

I'd probably include "developer" instructions in a new FOR_THE_DEVELOPER.md or something. A checklist that could be copied right into a GitHub issue would probably be pretty great also.

A sample test or two also wouldn't hurt.

jf990 commented 3 years ago

@joshgoebel

based on my work on other language contributions and the 3rd party quick start, I'd like to put forward this proposal

https://github.com/jf990/highlightjs-language-template

joshgoebel commented 3 years ago

Wow, I can tell you put some thought into this (particular the README). Thanks so much for submitting something. I have a few thoughts/concerns though:

On Testing

If we are still encouraging building inside the scope of Highlight.js, do we want to encourage the user to have their own test suite vs using the built-in testing they get "for free" if they just put there tests in the default test/markup folder? (esp one that requires changing "language" in like 100 places). And if we do want the tests per grammar (so they can run faster I'd assume?) should we encourage the same markup (.txt / .expect.text) naming we use in the main project.

I think perhaps we should match the directory locations/naming.

I think part of the idea of consistency is so that eventually we might see Highlight.js community contribute to the core library, then perhaps publish a 3rd party grammar, then perhaps one day fix a bug in someone else grammar. Having the entire ecosystem use the same conventions for testing makes this more likely I think.

Other thoughts

CC @highlightjs/core

tajmone commented 3 years ago

Thanks @jf990!

I agree with all of @joshgoebel's suggestions, especially regarding the need of a sample README, and I'd also add:

  1. Should add .editorconfig settings to enforce code sytles consistency with the main HLJS repository.
  2. I would use the term "syntax" instead of "language" (e.g. markdown, AsciiDoc, etc.,, being syntaxes not langs).
  3. I'd remove the CODE_OF_CONDUCT.md document and, in general make the README.md as close as possible to a general purpose boilerplate that end users need to tweak in a few places, with the exception of a few template usage guidelines (mostly links to external documents on how to create, test and use a custom syntax).

Since the switch from the old HJLS system to the newer one, with syntaxes in independent repositories, I haven't really understood how to:

Basically, the new system disconnected me from my syntaxes that were moved out of the main repo to an independent one — I simply stopped using them (and, consequently, stopped using HLJS in my documentation projects, which depend on automated updates of all dependencies), since they just became a maintenance burden with no personal benefits. At one point, during the switch, I simply gave up on understanding the new system due to the sheer amount of inflowing information, some times contradictory (at the time the new system was still a WIP). Now, years after, I've spent more hours than I'd wish sifting through lengthy documents, and still not being able to come up with a simple way to use my custom syntax the way I used to before (i.e. a practical usage solution that doesn't involve manually copying files around, etc.).

Possibly, the template's README should contain concise instructions on how to use a syntax in a standalone repository, with practical usage examples, especially on how to integrate the standalone repository syntax when using HLJS via the command line to build custom packages, and other tasks, like using it as an external highlighter (instead of having to include it in the final document as a JavaScript dependency, which some documents types don't allow).

The old system was much simpler, it allowed someone to contribute a simple custom syntax without having to spend 10 times more reading through documents, sifting through examples, etc, than it actually takes to create a syntax. — just add a new syntax to the existing ones in the main repo, no dependencies headaches, none of the problems that affect a detached repository.

joshgoebel commented 3 years ago

I'd remove the CODE_OF_CONDUCT.md document

I agree, not our place to force a specific CoC on anyone (there are many to choose from I think)... if someone wants to add one on their own, that's fine...

in general make the README.md as close as possible to a general purpose boilerplate

Agree, though I really like the idea of a META_README (ie, what is there now)... that explains the repo, maybe has a checklist, etc... just gotta figure out what to name it, and maybe the README could link to it on it's first two lines or so.

Build a custom HLJS package with your custom syntax AND some other native syntaxes from the official repo.

Generally:

git clone highlightjs_url
cd extra
git clone grammar_1
git clone grammar_2
cd ..
npm install
./tools/build.js -t browser :common grammar_1 grammar_2

You'll have a browser monolith just like the one we include by default, but now including grammar_1 and grammar_2.

Test the syntax using HLJS.

Not sure what you mean by test... but if you mean test/develop, just build the browser version, then open our developer tool.

./tools/build.js -t browser my_grammar_name
open ./tools/developer.html

If you mean tests/tests... then if you just put your files in the same folders in your repo as in the main repo (tests/markup, etc) then the main test suite will find them and include them by default.

joshgoebel commented 3 years ago

The old system was much simpler

It's been gone over many times why we don't currently do that anymore, so I won't repeat that here.

it allowed someone to contribute a simple custom syntax without having to spend 10 times more reading through documents, sifting through examples, etc, than it actually takes to create a syntax. — just add a new syntax to the existing ones in the main repo, no dependencies headaches, none of the problems that affect a detached repository.

I won't say the current system is "as easy as" before but I also don't think it's significantly harder... other than having to maintain your own repo over time (but that is kind of the main point). I think what we have is a lack of/poor documentation issue... if you'd like to help us fill in those cracks (and truly thanks for the feedback here so far!) and improve the docs so they don't require "10 times more reading" that'd be of benefit to everyone I think.

I'm happy to answer any questions on the "how it works" and hopefully my answers above already provided some clarity? it's possible that we could come up with a better name for extra as well, not much thought ever went into that.

We also have a Discord now for live chat: https://discord.gg/M24EbU7ja9

jf990 commented 3 years ago

not our place to force a specific CoC on anyone

ok, I understand. my proposal was to suggest that a CoC should align with highlight.js CoC, we don't want to have modules attached to the core that diverge, but, I don't think we actually established a defined CoC anyway. But let's think this through for future contributions Should contributors conform to a system-wide code? Would it be ok for a submodule to offer a CoC that was not aligned with the core's?

tajmone commented 3 years ago

@joshgoebel:

You'll have a browser monolith just like the one we include by default, but now including grammar_1 and grammar_2.

Cloning into the extra/ folder third party syntaxes definitely requires a coherent directory structure in those repositories. I did sift through the links you have provided me on various occasions, but I didn't find their instructions too clear (they seem more info fragments scattered around, rather than a systematic presentation of the whole picture).

I won't say the current system is "as easy as" before but I also don't think it's significantly harder... other than having to maintain your own repo over time (but that is kind of the main point).

It depends on what kind of syntaxes you work with. E.g. I work mostly with Interactive Fiction (text adventures) languages, which are hardly candidates for inclusion in the main repository, since their userbase is a small niche. But there are lots of these syntaxes, which means having to juggle lots of different repositories, editor projects, and having to multiply maintenance updates whenever some commonly shared assets requires tweaking the repositories (e.g. a CI tool, a linter, etc.).

If at least I could pack all those Interactive Fiction languages under a single repository it would make my life much easier — and probably make more sense too, for however is interested in any one of them is also most likely interested in that genre in general. Furthermore, instead of having many individual Interactive Fiction syntaxes scattered around, developers could join efforts in a collaborative umbrella project targeting the genre; after all, text adventures languages all belong to a common family of related tool.

Also, I need syntax highlighting mostly for AsciiDoc or Pandoc based documentation and book projects, which often involve complicate toolchain integrations (e.g. using multiple highlighters to cover all needed syntaxes), and usually avoiding having to include a JS library (HLJS often fails on very long single-page documents with lots of code, and doesn't work at all on GitHub & BitBucket HTML Previews, so it's best used as CLI tool for highlighting at conversion time).

I think what we have is a lack of/poor documentation issue... if you'd like to help us fill in those cracks

I would love to, and would have done so (as well as propose a GH template myself) if I were in position where I understood the big picture myself. Eventually I'll get there, and then I'll be able to help out (after all, working with documentation is my daily bread).

I'm not a Node/JS expert, though. But the reasons why I've procrastinated for so long updating my syntax repo has more to do with a bad timing when the new system came into being, because that year I had to deal with major health issues with kept me out of touch of much of what was happening in the FOSS world; and then later the COVID emergency badly impacted by working life, again affecting how much free time I can spend on any given project — ultimately being spread so thin, and having to catch up so many pending tasks, that I simply can't find enough time to read properly through all the docs I'd need to.

Anyhow, I'll try to get en par with the missed out evolvements.

jf990 commented 3 years ago

@joshgoebel

I think it's hard to tell when language is meant literally vs as a placeholder... we may need a fictitious grammar name

I was using language for this purpose, and I agree with you it is not a good choice. I'm almost OK with going with your suggestion pascal, or the original xyz, or anything else. it should be obvious everywhere that you see that identifier that you should replace it with your own language name. let's decide what keyword to use that is most obvious for this purpose and I'll get it in the next update.

Current this doesn't provide a default/sample README

It was absolutely my intent to provide this. I wanted to get through a first pass review of the layout and make any suggested changes, then go back and finish the README. a lot of what we are going over here would require significant changes to an early cut of the readme so it would be good to make these decisions first.

jf990 commented 3 years ago

On Testing

There are a few things to consider in my proposal. offering up a test framework that works on install and requires no effort is going to be helpful. we should note in the README that you don't have to use it.

I've developed 3 grammars and have 2 more in development. I do not find the current test harness helpful for grammar development. it basically gives the 3rd party developer 1 test case and if we want to make it work for developing a grammar then you are going to overload that single test case with lots of coverage as you can. then when you run the test you are forced to sit through and sift through 180 other grammar tests and core test all have nothing to do with your test and overall it just slows you down. then in subsequent grammar updates you don't have a way to focus on testing just your update, isolating regressions and bugs and new features, you have to test everything with just 1 test case.

The core tests are helpful and necessary for the core team and overall integration and maintenance, so I am not at all suggesting any changes, and the template must absolutely support creating and maintaining these tests as simple as possible for the 3rd party grammar developer. So anything the template does it must set up and support this test system out of the box.

What I am proposing is an additional test framework intended for the 3rd party developer to focus on just their specific testing requirements in the local environment. One of these tests is a clone of the exact test that runs when integrated in core. As was noted somewhere else, I did not set that up exactly right, the folder layout, file names, contents, should all exactly match the requirements. I'll update that. But, it also gives the developer the opportunity to write more than one test and have tests that focus on specific test cases for their development style. Again, it does not have to be used to succeed, but for getting started quickly it seems to be helpful. this wasn't even my idea initially, i saw many other grammars had their own test system, so I think it has value.

tajmone commented 3 years ago

@jf990:

What I am proposing is an additional test framework intended for the 3rd party developer to focus on just their specific testing requirements in the local environment.

I'm absolutely in favor of this, since each syntax poses its own challenges, and having extra flexibility to carry out extensive tests without burdening the main repo test suite is a good solution.

The main README should clarify the difference between the official test folder and files for integration in the main HLJS repo, on the one hand, and the additional (and optional) custom tests on the other, maybe providing instructions for the latter in a separate document (a README inside the folder?).

joshgoebel commented 3 years ago

If at least I could pack all those Interactive Fiction languages under a single repository it would make my life much easier

This is trivially simple and fully supported by the current system. One repo, all grammars in src/languages... there is no limit to one grammar per repo.

HLJS often fails on very long single-page documents with lots of code

I'd love to see an issue on this and what "fails" means.

doesn't work at all on GitHub & BitBucket HTML Previews

Can't speculate here without more information.

major health issues ... and then later the COVID emergency

Well we're glad you're still with us and hopefully things are calming down a little for ya.

joshgoebel commented 3 years ago

But let's think this through for future contributions Should contributors conform to a system-wide code? Would it be ok for a submodule to offer a CoC that was not aligned with the core's?

I'll discuss with the core team, but generally I tend to avoid inventing problems that don't exist. Issues like that (an 'inappropriate/incompatible CoC') can typically be dealt with one a one-off basis.

tajmone commented 3 years ago

HLJS often fails on very long single-page documents with lots of code

I'd love to see an issue on this and what "fails" means.

It means that with very long single-page documents it often fails to highlight all code blocks (it hangs). I'm not sure I can still provide a link to this, since when this happened I always switched to static highlighter to fix the issue at the root. This cam easily depend on the browser being used, or some machine specific issues, I don't know, I just noticed that this happened often whenever a single document started to grow in size and code blocks beyond a certain threshold (which I'm unable to quantify).

doesn't work at all on GitHub & BitBucket HTML Previews

Can't speculate here without more information.

Here's a live example:

https://htmlpreview.github.io/?https://github.com/alan-if/alan-docs/blob/master/manual/manual.html

from the repository:

https://github.com/alan-if/alan-docs/tree/master/manual

Probably it's due to HTML Preview using frames. The point is that often these Live Preview links is all that you rely on, since GitHub doesn't natively offer HTML previewing.

That's a also a fairly large (not huge, though) document to test whether HLJS hangs (e.g. if you occasionally need to refresh the document to get proper highlighting). But, from what I remember, the hanging issue was easier to be found in documents that contained many different syntaxes, whereas this one uses just one syntax.

joshgoebel commented 3 years ago

very long single-page documents

"very long" isn't super helpful if you can't quantity it. :-)

it often fails to highlight all code blocks (it hangs)

You might see:

https://github.com/highlightjs/highlight.js/security/advisories/GHSA-7wwv-vh3v-89cq

Also there have been a few infinite loop issues resolved in the past. But today we should not ever "hang". It's still of course possible there are regex issues unaccounted for, but I wonder if the issues you are referring to haven't been resolved with this security update. I suppose for a huge document it might take some time to highlight (and then for the browser to render)... but that is not a hang, that's a delay - and should be linear with the size of the document, etc...

If you find an example of this you should definitely file an issue as I'd be very curious to have a look.

joshgoebel commented 3 years ago

Probably it's due to HTML Preview using frames.

We shouldn't have any issue with frames, but obviously you'd need to load us in the frame with the actual content that needs to be highlighted.

jf990 commented 3 years ago

I need to commit to a placeholder for the language. I proposed language which was rejected above. Anyone have any thoughts on this?

jf990 commented 3 years ago

I reviewed a few other template repos, it seems a lot are using README to explain how the template works and also including a BLANK_README.md file with placeholder text for the intended user to alter to suit their project. It is documented for the user to delete README, update BLANK_README and then rename BLANK_README to README. I'm going with this for the next update.

jf990 commented 3 years ago

I also wanted to point our regarding the interchangeable use of language vs. grammar vs. syntax, the suggestion above is to use syntax instead of language, however the actual supporting documentation is using language and I purposely chose to align with the existing documentation.

however we decide to go, this effort should also work to align all of this documentation.

joshgoebel commented 3 years ago

Very much dislike "syntax". I never say that and don't want to start. :-)

Both language and grammar are correct in their own ways. You build a grammar to support a language. We have 180+ grammars and therefore support 180+ different languages. All the 3rd party repos I created for people have both in the description:

"robots.txt - a language grammar for highlight.js"

I think the fact that mostly we use "language" everywhere is OK. Practically speaking in my mind the terms are often loosely interchangeable. "language definition" also isn't terrible though.

tajmone commented 3 years ago

I think the fact that mostly we use "language" everywhere is OK. Practically speaking in my mind the terms are often loosely interchangeable. "language definition" also isn't terrible though.

I don't agree on this, Markdown, HTML, CSS, etc., are not languages but syntaxes (even though the L in HTML stands for Language, which has been one of the main roots of this confusion, since many self-made web developers started calling themselves "HTML programmers". Usually when the CV for a job application mentions HTML and CSS in the list of "past programming languages experiences," that's when the CV hits the bin.

The purpose of HLJS documentation should be to dispel confusion, not add to it — end users are already trying to represent one target syntax using another language (JS) as a vehicle to present it (including via RegEx notation). Having some clear terminology in place would be helpful. By using "target syntax", for example, one can avoid doubts about which syntax/language (or grammar) one is referring to.

joshgoebel commented 3 years ago

I don't agree on this, Markdown, HTML, CSS, etc., are not languages but syntaxes

We'll agree to disagree then. I think perhaps you mean they aren't "programming languages" - but then you're attempt to say a thing is not a language if it isn't a "programming language", which is patently false. HTML (by it's very definition - as you point out) is a language... Markdown is a markup language. English is a spoken language. Something does not have to be turing complete to be a language.

Wikipedia:

Cascading Style Sheets (CSS) is a style sheet language used for describing the presentation of a document written in a markup language such as HTML.[1] [emphasis mine]

Even if I were to agree - the fact that we still have ~180 programming languages and only a few "non-languages" (by your own definition) is not a valid reason to change our nomenclature to use different wording.

jf990 commented 3 years ago

All other alternatives aside, going with language aligns well with the existing documentation. We put the grammar implementation in the src/languages folder, we have the language contributor check list, etc., so use of language is fairly ubiquitous across the docs. Using a different term may not be wise to understanding how to contribute, not everyone here has English as the primary language, we should be precise and consistent with our terminology. When we talk about the thing you are contributing, it is a language, although it is acceptable to use grammar or syntax when conceptual reference is being made to what purpose the language serves.

Let's stick with language for this update and if we don't like it we can suggest changes with subsequent updates and PR reviews.

jf990 commented 3 years ago

@joshgoebel @tajmone @highlightjs/core

I updated https://github.com/jf990/highlightjs-language-template to address most of what we discussed above. I chose your-language as the dummy language name, it's a fairly obvious and easy to search and replace token.

there are still a bunch of downstream dependencies once we decide to go with this, I'll finish updates to #3042, propose some changes to https://github.com/highlightjs/highlightjs-robots-txt (the test subfolders require the language name path component in order to work inside extra), and update some of my other language repos to conform to this layout. We may also decide to update https://github.com/highlightjs/highlight.js/blob/main/.github/ISSUE_TEMPLATE/language-request.md and some other supporting resources. But getting this right is a precursor to all of that.

tajmone commented 3 years ago

but then you're attempt to say a thing is not a language if it isn't a "programming language", which is patently false.

Yes, this was my implicit intention (i.e. you can't program in English or Markdown). Since these instructions are targeting computer programmers, for the specific task of programming a new syntax module, it seems reasonable to assume (on all sides) that the term "language" is here used in its computer engineering meaning, unless otherwise specified.

But the point was all about potential confusion in the documentation, where it might apply, not for the sake of semiotics. Hence my suggestion to add "target" before language, in contexts where there might be ambiguity (if we were to stick to "language" as the general term to indicate all syntaxes).

In some key places, simply replacing "language" with "language/syntax" might also clarify the issue (IMO).

joshgoebel commented 3 years ago

...for the specific task of programming a new syntax module, [emphasis mine]

But are they programming? 😁 Many grammars are nothing more than static definition - no code at all. They could just as well be YAML or JSON - in fact with the PHP port of Highlight.js they are fully static. Or do you lump regex itself into the category of programming and therefore that is why writing grammars is "programming". :-)

Hence my suggestion to add "target" before language, in contexts where there might be ambiguity (if we were to stick to "language" as the general term to indicate all syntaxes).

I really have never followed your issue here. How about one or two specific example of where you feel using "target language" vs "language" would be an improvement?

tajmone commented 3 years ago

But are they programming?

Yes, end users are going to write their HLJS syntax definitions in JavaScript, which is a programming language, by all means. The term "syntax definition" in this context should mean the JS module that adds to HLJS the capability to highlight a new syntax/language (i.e. the "target syntax/language", from the developer perspective, since that is what he/she is planning to implement). The term "grammars" is somewhat vague in this context, since no BNF/EBNF grammars are involved.

joshgoebel commented 3 years ago

But are they programming?

Yes ... in JavaScript, which is a programming language

I think you entirely missed my point. Surely I can write a long novel inside a large string inside a .js file... but that is not the activity of programming, even though I'm using a programming language.

If writing JSON/YAML (structured data) is programming then surely HTML and CSS is also programming. 🤪

The term "syntax definition" in this context should mean the JS module that adds to HLJS the capability to highlight a new syntax/language (i.e. the "target syntax/language", from the developer perspective

I mean show me an exact before/after situation in our existing documentation where you think adding "target" would help. I'm just not getting it in the abstract, sorry. Need context.

Nezteb commented 11 months ago

I just noticed that 3RD_PARTY_QUICK_START.md links to a non-existent highlightjs/highlightjs-language-template repo. I know there is a "(this isn't ready yet!)" disclaimer, but in that case the link might as well not be there? If anything, I think it should link to @jf990's repo: https://github.com/jf990/highlightjs-language-template 😅