Evaluate Usage of Atom's Highlights

killercup commented 10 years ago

Atom's syntax highlighting stack has been open sourced. This post gives an overview.

Mainly, highlights is of relevance to grock, as it could replace highlight.js.

Important questions to answer:

[x] How is the language support?
Excellent.
[x] How is the styling support?
Good.
[x] Does it support hooks? (Cf. #5)
Probably.
[x] How fast is it? (Compare to highlight.js.)

killercup commented 10 years ago

How is the language support?

Excellent.

It seems like highlights itself supports language definitions either as TextMate grammar or as CommonJS packages with grammar files (cson/json).

Here's a list of language definitions for Atom.

killercup commented 10 years ago

How is the styling support?

Good.

But most themes for Atom are using LESS. The syntax highlighting definitions look like this index.less.

Here's a list of color schemes packages available for Atom.

killercup commented 10 years ago

Does it support hooks?

Probably.

The main highlighter class is quite short (<100 lines), see highlights.coffee. Similar code would have to be written to create a transform method which can be used by grock to highlight vinyl file objects.

It may be possible to detect and extract comments in code by adding a check in the loop to render tokens in scopes, as a comment would be a specific scope. That way, the beginning of a comment could be used to start a new file segment (each segment contains arrays of comments and code lines).

E.g., in the Lua grammar, the scopes comment.block.lua and comment.line.double-dash.lua are defined. Using this (and the prefixes punctuation.whitespace.comment and punctuation.definition.comment.lua), one should be able to correctly separate comments from Lua code.

killercup commented 10 years ago

I just implemented code splitting and highlighting with highlights. It's still quite rough around the edges but it renders grock's own documentation (js/coffee/md, thin style). Code is in the feature/14-highlights branch.

killercup commented 10 years ago

Benchmark: Rendering grock's Documentation

Test	highlights	highlight.js
Time	2.49s	1.90s
Output size (zip)	1.2MB (108kB)	339kB (84kB)

Considering grock takes approx. 1.1s to load on my machine, highlights seems to need at least 70% more time (times are averages of ~20 runs each).

Also worth noticing: The output size of the complete docs directory (incl. assets) is 339kB (84kb as zip) for highlight.js and 1.2MB (108kB as zip) for highlights. The latter seems to be putting everything it knows into the markup, producing truckloads of nested span tags.

jonruttan commented 9 years ago

While fiddling around with the code on this branch trying to see how much of the highlight.js and lib/languages.coffee code I could remove and still have it run, I stumbled upon the fact that that seemed to speed it up quite a bit.

I ran some tests to get some numbers. Like @killercup, my tests are based on rendering Grock's documenation. I used the following command to generate the docs:

./bin/grock --style=thin

Running the command outputs info about the build like so:

[grock] Beginning to process  523 ms
[grock] Writing to docs/
[grock] Done. Generated in 1.03 s

We're only interested in the amount of time it took to generate, so I used the following awk code to extract the number of milliseconds it took to generate the documentation from Grock's output:

awk '/Done. Generated in/ { print $5 * ($6 == "s" ? 1000 : 1) }'

I wanted to be able to compare my results with those which @killercup derived, so I've also averaged 20 iterations.

Here's the full code I used for the tests:

cat /dev/null > /tmp/grock.out.txt;
for i in $(seq 1 20); do 
  ./bin/grock --style=thin | awk '/Done. Generated in/ { print $5 * ($6 == "s" ? 1000 : 1) }' | tee -a /tmp/grock.out.txt;
done;
awk 'NR == 1 { max=$1; min=$1; sum=0 }
{ if ($1>max) max=$1; if ($1<min) min=$1; sum+=$1;}
END {printf "Min: %d ms.\tMax: %d ms.\tAverage: %.1f ms.\n", min, max, sum/NR}' /tmp/grock.out.txt;
rm -f /tmp/grock.out.txt

NOTE: The Min/Max/Avg is from this answer to Is there a way to get the min, max, median, and average of a list of numbers in a single command? on Unix & Linux Stack Exchange.

Running the script on the master branch, using highlight.js, gives the following output:

1000
976
1050
1020
965
958
962
998
1030
1050
1020
991
959
999
966
961
1040
1030
999
1050
Min: 958 ms.    Max: 1050 ms.   Average: 1001.2 ms.

On @killercup's machine this value was 1.90s, or 1900ms, giving a k value between the two machines of:

1900ms. / 1001.2ms. = 1.898

My next test was on the feature/14-highlights branch, which uses atom/highlights, but also requires highlight.js:

1040
1010
986
1010
1000
1040
994
993
997
991
1280
1340
1030
1050
1110
1200
1090
1070
1020
1050
Min: 986 ms.    Max: 1340 ms.   Average: 1065.0 ms.

If we multiply 1065.0ms by the k value from above,

1065.0ms. * 1.898 = 2021ms.

we get 2021ms, which is only 81% of @killercup's measurement of 2490ms. I would have assumed the values would be closer, so I can only speculate as to why there is such a big discrepancy (my tests are writing to an SSD, and atom/highlights is generating 3-4x more data, so it may be due to file I/O.)

Continuing on, I stripped out highlight.js from the branch using the following steps:

edit package.json, rm l. 46 # "highlight.js": "~8.0.0",
edit index.coffee, rm l. 1 # getLanguage: require('./getLanguage')
rm lib/languages.coffee
edit lib/generator.coffee, rm l. 55 # .pipe t.getLanguage()

Running the tests again gives:

952
983
943
914
926
915
940
940
941
914
925
960
926
941
953
926
944
948
1010
991
Min: 914 ms.    Max: 1010 ms.   Average: 944.6 ms.

This gives the fastest generation times yet.

For one last test, I upgraded the atom/highlights package in packages.json from 0.7.0 to the latest version, 1.2.0:

1130
1110
1110
1150
1110
1110
1130
1100
1110
1120
1140
1110
1180
1110
1130
1110
1130
1110
1090
1090
Min: 1090 ms.   Max: 1180 ms.   Average: 1119.0 ms.

This yields the slowest generation times yet. I wonder what's happened to atom/highlights to make the latest version 175ms. slower?

I'll run some more tests. It's probably worth trying a few of the more recent versions of atom/highlights to see if they're all as slow as version 1.2.0, and whether there's been a trend over time. I'd also like to see how hard it would be to modify atom/highlights so the renderer didn't output a sea of superfluous span tags.

If atom/highlights ends up being peppy enough, and some of its output tags can be trimmed back, then it might be worth re-opening this issue.

killercup commented 9 years ago

@jonruttan, thank you very much for your wonderful investigation!

If using highlights increases the compile time by 20% but makes grock's code much cleaner and easier to extend (e.g. by having the option to include custom language specifications from Atom), we should use it. Far more concerning to me is the amount of output highlights generates.

(What follows is speculation. I may be completely wrong. I haven't looked at the code in about a year.)

I noticed that transforms/splitCodeAndComments as well as transforms/highlight seem to call hightlights to parse a file (or parts of it).

The first to split code from comments (as the name suggests), the second to highlight the code. Ideally, a file should only be tokenized once, then split into segments, and then the sub-trees of tokens should be rendered.

The biggest problem with this will be that breaking code at seemingly arbitrary points can lead to invalid highlighting.

Do you think it might be possible to use highlights in such a way that we can insert split points that preserve the scope the code is currently in? Essentially, I'm talking about rewriting this part of highlights' highlightSync method so that instead of an HTML string it will return the segments that grock needs.

killercup commented 9 years ago

It may also be interesting to look more deeply at the implementation of first-mate (which provides the grammar registry and parsing hightlights uses). Skipping highlights and using first-mate directly might also allow us to skip the dependency on season, fs-plus and underscore-plus.

jonruttan commented 9 years ago

You're welcome @killercup, the tests are not a problem. Thanks for maintaining the project.

With this extra info I dug a little further into the code in the features/14-highlights branch. It turns out that all the work is being done in transforms/splitCodeAndComments, and transforms/highlight isn't being used at all.

I looked into your question about using highlights in such a way that we can insert split points that preserve the scope the code is currently in, and I don't think it looks like it would be too hard.

I also did a little work on reducing the size of the generated documentation. As is, the generated documentation is 1236KiB.

By removing the dispensable span tag being added on Line #114 of transforms/splitCodeAndComments it shrunk to 1052KiB.

With a quick hack to atom/highlights' pushScope method I was able to further reduce it to 1020KiB, and I'm pretty sure with some more work in that area I can shrink it some more. Beyond that it should be possible to further modify the way that atom/highlights outputs tags to take better advantage of nesting.

killercup commented 9 years ago

I looked into your question about using highlights in such a way that we can insert split points that preserve the scope the code is currently in, and I don't think it looks like it would be too hard.

Thats great news, @jonruttan!

Are you going to continue working on this? I would love to see a pull request with your changes! (Even work-in-progress, you can just make it merge to the feature/14-branch on this repo.)

it shrunk to 1052KiB

Are the 339kB for highlight.js I quoted above still valid? That means the newly generated code will be about 3 times larger. We should recommend using a compressed file system in the Readme ;)

Edit: I totally forgot to reopen the issue earlier.

jonruttan commented 9 years ago

Are you going to continue working on this?

I'll definitely continue working on this – there appear to be many upsides to using using atom/highlights, and after poking my nose into the code yesterday I'm sure we have a few avenues for remedying the size/performance issues it introduces.

I would love to see a pull request with your changes! (Even work-in-progress, you can just make it merge to the feature/14-branch on this repo.)

Sure thing. If I can get what I'm working on finished up I'll push these changes out in the next few hours.

Are the 339kB for highlight.js I quoted above still valid?

No, it's grown a bit to 476KiB, so the newly generated code is a little over ~2x larger.

We should recommend using a compressed file system in the Readme ;)

lol, that's one way of dealing with it.

jonruttan commented 9 years ago

Hi @killercup. I've made some progress on this front, but I wasn't comfortable pushing my changes to the repo because the hacks I'd made to the code broke too many tests.

Last week I spent some more time analysing the code, and in the end I decided to forego atom/highlights and use atom/first-mate instead. The lack of SoC between the Lexer and the Renderer sections within the Highlights.highlightSync() function are making it too hard to use for our purposes, so I'm splitting the functionality apart.

There is some handy helper code in the Highlights class which loads the TextMate grammar files. Rather than use first-mate directly, I'm breaking that helper code out into a Lexer class. Like first-mate, the Lexer's main function will take text as input but it will just return the tokenized lines instead of rendering them. Complementary to the Lexer, the Renderer's job will be to take tokenized lines as input, and return the rendered output. Between the Lexer and the Renderer, we can split the tokenized lines into two streams, further parse the comments, then render each — or at least that's the plan…I'll keep you posted.

killercup commented 9 years ago

Thanks for the update, @jonruttan. That sounds like a great plan so far. Don't worry about breaking tests, by the way. A lot of them are just testing the current transforms and you can just remove them (or replace them with equivalent ones for your new transforms).

jonruttan commented 9 years ago

Hi @killercup, I got the Lexer class done up and published to npm. It's at https://github.com/jonruttan/textlex. No Grock docs for it yet though :wink:

I'll get started on the renderer next. The work for it is pretty much the same as what was needed to get the Lexer published, so it shouldn't take too long.

I haven't looked into all of the details yet, but now that we've got the lexer, we can probably use it to do the Markdown parsing as well. My textlex package has a CLI interface, so you can use that with some Markdown files if you'd like to see some examples of the tokens it outputs.

killercup commented 9 years ago

@jonruttan, sorry it took me two weeks to respond. Your textlex looks quite nice! (And reminded me that I wanted to compile grock's CoffeScript sources at npm republish). I've only played with it for a few minutes and had a glance at the code, though.

I'll get started on the renderer next. The work for it is pretty much the same as what was needed to get the Lexer published, so it shouldn't take too long.

Have you started with this yet? It might make sense to have a separate module for the renderer as well, it's quite a big part of grock and isolating that (with clean code and better tests) would make grock itself quite simple. (You can just take my code and put it into a new repo if you like, I don't mind.)

jonruttan commented 9 years ago

Hi @killercup,

sorry it took me two weeks to respond.

np, I've been really, really busy with some other work for the last two weeks, but your timing is perfect, I was just about to do some work on this.

Your textlex looks quite nice!

Thanks for the complement!

I'll get started on the renderer next. The work for it is pretty much the same as what was needed to get the Lexer published, so it shouldn't take too long.

Have you started with this yet?

Yes, it's done and just needs to be published.

It might make sense to have a separate module for the renderer as well, it's quite a big part of grock and isolating that (with clean code and better tests) would make grock itself quite simple. (You can just take my code and put it into a new repo if you like, I don't mind.)

Yeah, I agree, and I've been working towards that. I build the renderer just like atom/first-mate, but in reverse. There's a directory of json/cson renderers in a format which mirrors the TextMate grammars. Right now I have a plain text and an HTML renderer, but making new formats is really easy. I'll get the code published and then you can have a look. There's still a few more modules I need to complete before we can incorporate it into Grock, though. Luckily they're all pretty trivial — a published module for each renderer with a suite of tests, and a companion module for this new renderer which performs a similar task to what TextLex is doing, but for handling a repository of renderers instead of grammars.

killercup commented 9 years ago

Sounds great, @jonruttan!

Just a quick question: How easy is it for a user to add a custom grammar module? There are a ton of those for Atom, and ideally I'd like to offer users the choice to not just use those we consider worthy but add their own; using e.g. npm install -D lee-dohm/language-r and a bit of config.

jonruttan commented 9 years ago

How easy is it for a user to add a custom grammar module?

Trivial, each one is just a serialisable JavaScript object, typically a JSON/CSON file.

The first-mate module has the code for a Grammar, and a GrammarRegistry class. The GrammarRegistry maintains the master list of Lexer grammars, and it's used to search through the grammars to find the best match for the file type being parsed. When the best-match is found, the Registry instantiates a Grammar object initialised with the patterns and rules from the JSON/CSON Grammar file loaded into it, ready to start parsing a file of that type.

First-mate doesn't ship with any Grammars, nor does it maintain any repositories, so one of the jobs the TextLex module is performing is to deal a set of grammars as sub-modules and registering them in the GrammarRegistry. Right now it just scans an individual directory for extra grammars — the CLI version allows specifying a custom path to use — but it could easily be changed to search multiple paths, use a hand-curated list, or some combination of the two.

jonruttan commented 9 years ago

Something else that First-Mate has is Grammar injections, which seem to be Grammars within other Grammars. If I've correctly interpreted what they're meant to do, I think we might be able to get the Lexer to parse the source code and the Markdown comments all in one pass. Seems worth finding out more about how they work, so I'll do that when I've gotten these other jobs done.

jonruttan commented 9 years ago

All of the modules for the renderer have been finished and published. Now I can start incorporating them into Grock.

You asked earlier:

How easy is it for a user to add a custom grammar module?

The jonruttan/textlex specs have an example of doing this (the code was inherited from atom/highlights) with language-erlang: https://github.com/jonruttan/textlex/blob/master/spec/textlex-spec.coffee#L37-52

And here's an example using language-r from the CLI:

npm install -g textlex
npm install -D lee-dohm/language-r
echo "cat('Hello, world\\\\n')" | textlex -i ../node_modules/language-r/grammars -f r

[
  [
    {
      "value": "cat(",
      "scopes": [
        "source.r"
      ]
    },
    {
      "value": "'",
      "scopes": [
        "source.r",
        "string.quoted.single.r",
        "punctuation.definition.string.begin.r"
      ]
    },
    {
      "value": "Hello, world",
      "scopes": [
        "source.r",
        "string.quoted.single.r"
      ]
    },
    {
      "value": "\\n",
      "scopes": [
        "source.r",
        "string.quoted.single.r",
        "constant.character.escape.r"
      ]
    },
    {
      "value": "'",
      "scopes": [
        "source.r",
        "string.quoted.single.r",
        "punctuation.definition.string.end.r"
      ]
    },
    {
      "value": ")",
      "scopes": [
        "source.r"
      ]
    }
  ]
]

Piping the JSON tokens through my newest module jonruttan/delexe renders it, in this case to HTML:

npm install -g delexe
echo "cat('Hello, world\\\\n')" | textlex -i ../node_modules/language-r/grammars -f r | delexe -f html

Outputs:

<pre class="editor editor-colors"><div class="line"><span class="source r">cat(<span class="string quoted single r"><span class="punctuation definition string begin r">&#39;</span>Hello,&nbsp;world<span class="constant character escape r">\n</span><span class="punctuation definition string end r">&#39;</span></span>)</span></div></pre>

killercup commented 9 years ago

Ver nice, @jonruttan! I'm looking forward to reading your grock implementation! Let me know if you need any help! (In the meantime, I think I've also mastered Gitter's notification settings so I'll react to mentions on https://gitter.im/killercup/grock :))

killercup / grock