Open mikemccand opened 8 months ago
Do you really have to start from the surface-text tokenization? Because Java has all the APIs to parse the source code already into a fully annotated tree.
For example, I believe javadoc can generate annotated source when using -linksource option and that should be accessible via Taglet API. There is also annotation API, also using java model packages, I think.
I am not saying it will be an easier path. It does not seem to be used by too many people. But it has a possibility to provide more value. E.g. give higher weight to the method names where they defined as opposed to where they are called.
I guess this depends on what you invisage by "searching source".
I like that idea @arafalov -- that would give us a nice initial tokenization, and the deep metadata (class name, method name, a class being subclassed, etc.) could enable awesome faceting / fielded search e.g. search specifically for classes matching the text, or, say only string literals, or, maybe exceptions being thrown.I love it!
We would probably do a bit more additional tokenizing, e.g. maybe split_on_underscores, or, SplitOnCamelCase, so search would match fragments inside these longish variable names. Or so synonyms could apply.
And we'd apply some of this to the query text too...
Looks like the com.sun.source.tree
package has all the juicy stuff.
I currently work at Sourcegraph on code search and would be happy to collaborate on this! For reference we use Zoekt (https://github.com/sourcegraph/zoekt), the leading open source code search engine that originally powered Google's internal code search. There are lots of nice ideas there like ngram indexing, symbol identification and boosting, and more.
If you want to go the tokenization route, we've had good luck with tree-sitter, which is a fast parser that works for most major languages. It's used all over the place, for example at GitHub for syntax highlighting and source code navigation.
@stefanvodita has a cool suggestion on the Lucene dev list thread announcing githubsearch:
Maybe we could index Lucene's source code too? How fun it'd be to search Lucene's own source code using Lucene!
GitHub recently migrated away from Elasticsearch to their own Rust-based search engine ... perhaps, if that is open source too, we could poach some ideas. Specifically the programming language tokenization would be a fun problem to tackle. It looks like GitHub's search uses ngrams for this.