Language tagging - Githubissues

GoogleCodeExporter commented 8 years ago

Creating this issue to spec out language tagging requirements.

The feature would allow some way of specifying the language of the block so
that it can be tokenized and highlighted appropriately.

CURRENTLY
=========
There are 2 lexers: one for C-style languages, and one for markup languages.

The C-style lexer does a decent job on some of the most commonly used
languages (incl. python and bash, but excl. lisps and basics), and the
markup one handles XML, HTML, and various HTML-like templating languages.

The current lexing scheme allows descent into tokens with a different lexer.

REASONS FOR
===========
We do not handle other languages, notably VB, Perl, OCAML.  And cannot
without significant work.  Determining the language of a snippet is hard to
do, and if we do it wrong it would make the library less reliable/useful
for those languages it currently supports.

The keyword list for C-style languages is a union of the keywords from all
the languages I've tested with.  It misidentifies as keywords some tokens,
e.g. "template", that are not keywords in many languages.

Some languages (java) have consistently observed naming conventions that
distinguish types, fields, locals, and constants.  Those conflict with
common naming conventions in e.g. C++.

REASONS AGAINST
===============
(1) Bloats code.  Due to lists of keywords and code for languages not used.
 Could be mitigated by some kind of inheritance of definitions, or by
splitting into files.

(2) Complexity of install.  Mitigating (1) by splitting into multiple files
would make it harder to install.  Currently there is only one file to deal
with.

(3) Complexity of use.  Currently the API is very simple.  Could mitigate
by falling back to the existing behavior if no lang specified.

GOAL
====
Provide optional language tagging without bloating code.  Preference is
given to simplicity of use, so we will retain the one file to install property.

DESIGN
======
The current scheme is complicated by the fact that we highlight around
tags, so that if the source includes links around class names, those are
preserved in the prettified output.

Instead of preserving those in stream as first-class tokens, we will
extract those out, keeping their position in the original stream so they
can be reinserted later.

This will let us eliminate the current state machines which take a lot of
code, in favor of regular expressions.

We can inherit keyword lists by using one keyword list as the prototype of
another.

Original issue reported on code.google.com by mikesamuel@gmail.com on 15 Aug 2007 at 6:54

GoogleCodeExporter commented 8 years ago

Language tags should be easy to recognize and remember.

Since we use class="prettyprint" to identify regions to prettyprint, I suggest 
the
following convention

class="prettyprint"  -- make a best guess as to language
class="prettyprint lang-java"  -- do java prettyprinting

The "lang-" prefix is followed by the filename extension commonly used for 
source
files in that language to avoid problems with C# not being a valid html 
identifier. 
We will use cc for C++ since it is an identifier, and more commonly used than 
cpp or cxx.

Original comment by mikesamuel@gmail.com on 15 Aug 2007 at 6:57

GoogleCodeExporter commented 8 years ago

To flesh out the high level design, the prettify loop will be changed to:
(1) Extract tags and store [tag, position-in-string]
(2) Use a regex based lexer to lex the string sans tags
(3) Run a classifier over tokens
(4) Merge tags back into token list and join tokens to produce html
from the current
(1) Split into chunks of tags | text
(2) Split text chunks into tokens using a state machine over a character 
iterator
that unescapes entities lazily
(3) Join token list to produce html

This will cut out the hand coded state machines that iterate over characters,
replacing them with the regex based lexers from 2.

We can then define a language handler as a { lexer, classifier } pair.

Define a language handler for C-style langs and one for markup langs to get us
backwards compatible.

Modify the main prettify function to look for a lang-\w+ class, and, if present,
choose the appropriate lexer.

Implement a lisp/scheme lexer to demonstrate that new handlers can be added and 
document.

Implement other lexers as demanded.

Original comment by mikesamuel@gmail.com on 15 Aug 2007 at 8:43

GoogleCodeExporter commented 8 years ago

Finished rewriting the existing lexers to use PR_createSimpleLexer which is 
regexp based.

Original comment by mikesamuel@gmail.com on 31 Aug 2007 at 8:49

GoogleCodeExporter commented 8 years ago

I realize this would be an entirely different thing, but what about taking 
advantage
of a library of pre-written syntax highlighting rules, like VIM's? The
syntax-defining commands aren't that complicated. (Well, they don't seem to be, 
what
do I know?)

Original comment by partda...@gmail.com on 7 Feb 2008 at 11:34

GoogleCodeExporter commented 8 years ago

@r38

Original comment by mikesamuel@gmail.com on 5 Jul 2008 at 4:04

Changed state: Fixed

nbell12 / google-code-prettify

Language tagging #17