jplag / JPlag

State-of-the-Art Software Plagiarism & Collusion Detection
https://jplag.github.io/JPlag/
GNU General Public License v3.0
992 stars 290 forks source link

Generic input for easier long-term support of languages #1734

Open wadoon opened 2 months ago

wadoon commented 2 months ago

I just stumpled across jPlag.

It is a bit pity, that some languages are in legacy state. I just want to suggest, that a common input format tokens streams (or AST) might be useful for support various languages. For example, to write a parser for Python is hard, but also Python delivers a reusable parser that works and creates ASTs, easy to get the token stream from it and to store this into a JSON, sexpr, etc. and to load this list of tokens into jPlag.

If you would have generic input model in which you can declare your tokens (or AST), you can use the existing parser and write a small adapter for translation.

This might get interesting if you look at tree-sitter. This is a parser framework with several hundreds languages and widely used for syntax-highlighting, etc. Tree-sitter provides an uniform AST representation (s-expr). To support this, or similar format can boost the reach tremendously.

Greetings from down the floor, Alexander

tsaglam commented 2 months ago

The term legacy is probably not clear enough. It just means the language module is mostly still in a state of the legacy version of JPlag (v2.x.x and earlier). Regarding the generic language module that supports a common input format: There was an implementation of this in the fork of @CodeGra-de. However, it might not be there anymore.

We have been thinking about integrating tree-sitter before. However, we probably would prefer direct integration via Java bindings to a more decoupled approach via a generic language module. Tree-sitter would provide us with more means to parse up-to-date language versions, but it would not make language modules obsolete, as a carefully designed tokenization (some parse tree nodes are extracted as tokens, some not. some nodes map to the same token type, others do not.) strategy is crucial for a good detection quality.

The quality and currentness are pain points of some of the antlr grammars (almost all are from https://github.com/antlr/grammars-v4/), so I would not rule out a tree-sitter integration.

wadoon commented 2 months ago

The target is not to have a tree-sitter integration. The target would rather allow a generic program, that can be triggered from jplag and returns a JSON (or whatever format) object describing the list of token. For example,

jplag --language rust --use-preprocessor './rustAST2json {file}' *.rs`

The given program ./rustAST2json reads the given file, and returns via stdout the token information.

tsaglam commented 2 months ago

This is roughly the generic language module that CodeGra-de made, but we do not currently plan to implement such a feature. Direct integration of tree-sitter, however, is something we may consider in the future.