Accessing the lexer - Githubissues

Vipitis commented 2 months ago

Hey,

Is there a way to access the lexer via the Python bindings? I am working on a code clone classification metric (lexical, syntactical, near miss, semantic). . It sounds like the way to go is get lexing is with debug arg. But that is out of scope for the python bindings. As a solution right now we simply walk the tree and get to all leafs, as those are supposed to be tokens.

amaanq commented 1 month ago

The lexer is not really accessible in any of the bindings, what exactly are you looking to do with the lexer? If you're writing a grammar and want to lex certain tokens yourself, you can use an external scanner, but I don't think this is exactly what you're asking - would be nice if you could clarify/explain more.

Vipitis commented 1 month ago

This is for a code generation metric, and I want to implement some level of clone detection. The literature isn't clear on this exactly but I settled on these four types:

type-0. exact match (I am just comparing two strings)
type-1. lexical similarity: only difference in white space and comments.
type-2. syntactic similarity: only difference in identifier names (function names, variable names, etc)
type-3. near miss: some additional or removed statements (not doing this atm)
type-4. semantic similarity: exactly the same output, could be a complete different algorithm.

for types 1 and 2 I decided to use tree sitter for lexing (I am already using tree sitter in the project for something else). here is my wip function. It just walks the tree recursively and returns a list of all tokens. I put in options to skip comments (for type-1) and replace identifiers (type-2). But namespaces will be an issue for the later - which I might be able to ignore since we are only ever caring about a single function we are comparing. Elsewhere in the code I then simply check if the lists are the same for reference and prediction.

# simple lexer?
import tree_sitter

def get_leaves(subtree: tree_sitter.Node, skip_comments: bool=False, rename_identifiers: bool=False) -> list[str]:
    # TODO: add like a wrapper function to give the root node initially...
    tokens = []
    if subtree.child_count == 0:
        if subtree.type == "comment" and skip_comments:
            pass
        elif subtree.type == "identifier" and rename_identifiers:
            # TODO: what about different name spaces - where do we hand this upwards?
            # do we need to like return our mapping to get different placeholders?
            return [f"id"]
        else:
            return [subtree.text]
    else:
        for child in subtree.children:
            tokens.extend(get_leaves(child, skip_comments, rename_identifiers))
    return tokens

I hope this explains my usecase and current implementation. I got a few more steps to optimize (like just comparing the relevant function definition node, not the whole tree, and adding unique IDs to indentifiers?) - but this is already reasonably fast and I can use it as is. If you got any suggestion for a better implementation let me know - otherwise feel free to close this as not planned.

tree-sitter / py-tree-sitter

Accessing the lexer #260