tree-sitter / java-tree-sitter

Java bindings to the Tree-sitter parsing library
https://tree-sitter.github.io/java-tree-sitter/
MIT License
27 stars 6 forks source link

Provide convenience method for loading language library #23

Closed Marcono1234 closed 2 months ago

Marcono1234 commented 2 months ago

(Please correct me if anything of the following is wrong)

If I understand it correctly, for all parser implementations there is always a tree_sitter_<lang> function, and it always has the same signature.

Currently jtreesitter only provides a Language(MemorySegment) constructor, so you have to generate boilerplate code which looks up the tree_sitter_<lang> function and invokes it (as done in the test code). This can be an obstacle for new users of jtreesitter because they either have to be a bit familiar with java.lang.foreign, or blindly copy code they don't understand.

It would be useful if Language provided a convenience method for this, for example:

public static Language loadLanguage(SymbolLookup parserLibrary, String languageName)

The user could then easily use SymbolLookup#libraryLookup to load the library and then use that Language#loadLanguage method.

If you want I can try to create a proof-of-concept PR for this.

ObserverOfTime commented 2 months ago

The plan is to eventually integrate those bindings into the parsers (see tree-sitter/tree-sitter-java#182).

Marcono1234 commented 2 months ago

But that is specifically for tree-sitter-java, right? That would certainly be useful, but I was thinking of a more general solution for all parsers, e.g. Python, JSON, ... since they all have a tree_sitter_<lang> function with the same signature (?).

ObserverOfTime commented 2 months ago

The CLI will generate bindings for all parsers like it does for other languages.

Marcono1234 commented 2 months ago

Ah, I think I misunderstood you. Is the plan to generate Java bindings for all parsers, and the tree-sitter-java one was just an example? That would be great then!

But would it make sense nonetheless to add a generic loadLanguage method here, for cases where a repository does not include a bindings/java/.../TreeSitter<lang>.java yet?

I was thinking of something like this:

public final class Language {
    /**
     * Loads a language using the given symbol lookup for the native library.
     * For example:
     * {@snippet lang=java :
     * Path pathToLibrary = Path.of("libtree-sitter-python.so");
     * SymbolLookup libraryLookup = SymbolLookup.libraryLookup(pathToLibrary, Arena.ofAuto());
     * Language language = Language.loadLanguage(libraryLookup, "python");
     * }
     * 
     * @throws IllegalArgumentException If the Tree-sitter language function cannot be found using the symbol lookup
     */
    public static Language loadLanguage(SymbolLookup symbolLookup, String languageName) throws IllegalArgumentException {
        String functionName = "tree_sitter_" + languageName;
        MemorySegment functionAddress = symbolLookup.find(functionName)
            .orElseThrow(() -> new IllegalArgumentException("Language function '%s' not found".formatted(functionName)));

        var voidPtr = ValueLayout.ADDRESS.withTargetLayout(MemoryLayout.sequenceLayout(Long.MAX_VALUE, ValueLayout.JAVA_BYTE));
        var funcDesc = FunctionDescriptor.of(voidPtr);
        var function = Linker.nativeLinker().downcallHandle(functionAddress, funcDesc);
        MemorySegment languagePointer;
        try {
            languagePointer = ((MemorySegment) function.invokeExact()).asReadOnly();
        } catch (Throwable t) {
            throw new RuntimeException("Failed to call language function", t);
        }

        return new Language(languagePointer);
    }

    /**
     * Creates a new instance from the given language pointer.
     *
     * <p>Normally you don't have to obtain the language pointer yourself. Instead, you can either use the
     * generated Java bindings for a parser, for example:
     * {@snippet lang=java :
     * var pointer = TreeSitterPython.language();
     * Language language = new Language(pointer);
     * }
     * Or you can use {@link #loadLanguage(SymbolLookup, String)} to obtain a {@code Language} instance.
     *
     * @implNote It is up to the caller to ensure that the pointer is valid.
     *
     * @throws IllegalArgumentException If the language version is incompatible.
     */
    public Language(MemorySegment address) {
        // ...
    }

    // ...
}

The Javadoc here intentionally refers to tree-sitter-python to reduce confusion and to indicate that it works with any parser; otherwise a user might confuse tree-sitter-java with java-tree-sitter / jtreesitter, or think this jtreesitter only works with the Java parser.

ObserverOfTime commented 2 months ago

But would it make sense nonetheless to add a generic loadLanguage method here

Only until the bindings are autogenerated, at which point it'll be deprecated.