pygments / pygments

Pygments is a generic syntax highlighter written in Python
http://pygments.org/
BSD 2-Clause "Simplified" License
1.75k stars 640 forks source link

Add examples to all lexers #2716

Open kdeldycke opened 1 month ago

kdeldycke commented 1 month ago

Illustrate all lexers with their available example files. Also removes all inline examples.

kdeldycke commented 1 month ago

For reference, I used this script and some manual update to produce this PR:

import ast
import inspect
from pathlib import Path

from pygments.lexers import (
    _fn_matches,
    find_lexer_class_by_name,
    find_lexer_class_for_filename,
    get_all_lexers,
)

# Map aliases to their lexer class.
alias_map = {}
for name, aliases, extensions, mimetypes in get_all_lexers():
    for alias in aliases:
        alias_map[alias] = find_lexer_class_by_name(alias)

for dir in Path("./tests/examplefiles").iterdir():
    if not dir.is_dir() or dir.name in {"__pycache__"}:
        continue

    alias = dir.name
    assert alias in alias_map, f"Alias {alias} not found in pygments"

    klass = alias_map[alias]

    # Select the right example file.
    example_files = []
    for f in dir.iterdir():
        if f.is_file() and f.suffix != ".output":
            example_files.append(f)

    assert example_files, f"No example files found for {alias} in {dir}"

    candidates = []

    # Let's try to find the right example file by matching the lexer class.
    for f in example_files:
        matching_class = find_lexer_class_for_filename(f.name)
        if matching_class == klass:
            candidates.append(f)

    # Try to find the right example file by matching the lexer filename.
    if not candidates:
        # The class defines no filename patterns, so any example files under the alias are candidates.
        if not klass.filenames:
            candidates = example_files

        # The class defines filename patterns, so we try to match them.
        for pattern in klass.filenames:
            for f in example_files:
                if _fn_matches(f.name, pattern):
                    candidates.append(f)

    # No proper candidates found, so accept any example file.
    if not candidates:
        candidates = example_files

    # If multiple example files are found we, choose the biggest to show-case all the lexer capabilities.
    if len(candidates) > 1:
        candidates.sort(key=lambda f: f.stat().st_size, reverse=True)

    example_path = f"{alias}/{candidates[0].name}"

    # Search for the lexer class definition
    source_file = Path(inspect.getfile(klass))
    source_code = source_file.read_bytes()
    tree = ast.parse(source_code)
    klass_nodes = [
        node
        for node in tree.body
        if isinstance(node, ast.ClassDef) and node.name == klass.__name__
    ]
    assert (
        len(klass_nodes) == 1
    ), f"Multiple class definitions found for {klass.__name__} in {source_file}"
    klass_node = klass_nodes[0]

    # Search for the line numbers of the version_added and _example attributes.
    version_added_lineno = None
    example_lineno = None
    col_offset = 0
    for stmt in klass_node.body:
        # Ignore statements that are not assignments with single target.
        if not isinstance(stmt, ast.Assign) or len(stmt.targets) != 1:
            continue
        var = stmt.targets[0]
        # Ignore assignments that are not to the version_added attribute.
        if isinstance(var, ast.Name):
            if var.id == "version_added":
                version_added_lineno = stmt.lineno
                col_offset = stmt.col_offset
            elif var.id == "_example":
                example_lineno = stmt.lineno
                col_offset = stmt.col_offset

    # By convention, we will place the _example attribute right after the version_added attribute.
    assert version_added_lineno, f"version_added attribute not found in {source_file}"

    example_statement = f"{col_offset * ' '}_example = '{example_path}'"

    new_source = []
    for index, line in enumerate(source_file.read_text().splitlines()):
        if example_lineno:
            if index + 1 == example_lineno:
                print(
                    f"Updating _example attribute in {source_file} at line {example_lineno}"
                )
                new_source.append(example_statement)
                continue
        else:
            if index + 1 == version_added_lineno:
                print(
                    f"Adding {example_statement!r} to {source_file} at line {version_added_lineno + 1}"
                )
                new_source.append(line)
                new_source.append(example_statement)
                continue
        new_source.append(line)

    source_file.write_text("\n".join(new_source) + "\n")
Anteru commented 1 month ago

While in theory this seems like a good idea, all our lexers are on one page, and this produces an absolutely massive file as some example files are huge. lexers.html becomes 22 MiB with this change. I'd love to merge this, but I don't think that it's that useful as-is simply because we've got no way to show something that big in a sensible way. (Which is probably also a good thing to fix eventually - some example files are huge and we should have a small example for every lexer to quickly test it.) I do appreciate the work you've been doing here, but that's not going to fly as is.

Maybe if we could generate an example snippet per file and dynamically embed it via some JavaScript? Like, "show an example" and it shows an example? But then again, is there really that much value in there? Originally the examples where added for REPL environments where it's not always entirely clear how the input should look like. I'm open for ideas here.

Interestingly, I'm also getting a warning from "notmuch", which makes me a bit worried. This should also happen in tox but I'm not seeing i there and I'm wondering if we've been skipping a test file? If I had to guess it's because notmuch_example contains a few funky bits it in there which didn't survive the example-file handling.

kdeldycke commented 1 month ago

Ahah yes, no worries, I was also aware of this over-the-top approach. 😅 But couldn't resist to test it anyway hence my PR! 😝

A huge lexers.html is off course not a good idea. Now what do you think of splitting the lexers.html into sub-pages? Like one for each lexers/*.py file? This would split-up the big file and maybe increase the search engine ranking. While keeping lexers of the same family within the context of the same page (great to remove noise but still allow for comparison between implementation details of different flavors).

I choose the biggest files explicitly in my heuristics as a proxy to demonstrate all the arcane capabilities of each lexer. My rationale was that by showing the result of each lexer in the doc, we would increase the likeliness of contribution. A detail missed by a lexer or a mistake would be immediately spotted by a user fluent in that language.

Anteru commented 1 month ago

Yeah, if there's a way to do split up lexers.html into individual sub-pages, and have lexers act as the index, we might be on to something, but then I'd still argue that most example files aren't really good examples for the documentation. If we have individual pages, we could always put the examples at the bottom and just have all of them there even? Would have to look at it before making a call :(