google / zoekt

Fast trigram based code search
1.69k stars 113 forks source link

Add faceted search / custom filters / heterogenous monorepo indexation #62

Open roscopecoltran opened 6 years ago

roscopecoltran commented 6 years ago

Hi guys,

Hope you are all well !

I was wondering how it would the best to add some faceting to search results to display with zoekt web-server, like filtering by language or by some custom user defined filters like matching examples below.

I just want to extend zoekt to filter a large heterogenous code monorepo (mainly all my local repositories > 500 repos). And, I am struggling to assess if I should create a blevesearch index, after zoekt indexation, or it could be possible to add some post/pre processing plugins while indexing the code with zoekt.

I found this cool tokenizer package, from a stackoverflow employee, https://github.com/clipperhouse/jargon, for recognizing canonical and synonymous dev/tech terms, that I wanted to chain in parallel, as a plugin, for post-processing of the topics extracted from the code indexation.

Question: What would be the best approach to build a quick poc with these external filtering bots/plugins ?

Cheers, Rosco

Examples:

[
    {
        "language":"conan",
        "type":"BuildSystem",
        "fileNames":["conanfile.txt", "conanfile.py", "conanenv.txt"]
    },
    {
        "language":"scons",
        "type":"BuildSystem",
        "fileNames":["sconstruct"]
    },
    {
        "language":"premake",
        "type":"BuildSystem",
        "fileNames":["premake4.lua", "premake5.lua"]
    },
    {
        "language":"gulp",
        "type":"BuildSystem",
        "fileNames":["gulp.js"]
    },
    {
        "language":"zeus",
        "type":"BuildSystem",
        "fileNames":["zeusfile.yml"]
    },
    {
        "language":"bam",
        "type":"BuildSystem",
        "fileNames":["bam.lua"]
     },
     {
        "language":"meson",
        "type":"BuildSystem",
        "fileNames":["meson.build"]
    },
    {
        "language":"hunter",
        "type":"BuildSystem",
        "fileNames":["huntergate.cmake", "hunter.cmake"]
    },
    {
        "language":"cget",
        "type":"BuildSystem",
        "fileNames":["requirements.txt"]
    },
    {
        "language":"conda",
        "type":"BuildSystem",
        "fileNames":["meta.yaml"]
    },
    {
        "language":"shake",
        "type":"BuildSystem",
        "fileNames":["build.hs"]
    },
    {
        "language":"gemfile",
        "type":"BuildSystem",
        "fileNames":["gemfile"]
    },
    {
        "language":"npm",
        "type":"BuildSystem",
        "fileNames":["package.json"]
    },
    {
        "language":"webpack",
        "type":"BuildSystem",
        "fileNames":["webpack.config.js"]
    },
    {
        "language":"bower",
        "type":"BuildSystem",
        "fileNames":["bower.json"]
    },
    {
        "language":"maven",
        "type":"BuildSystem",
        "fileNames":["pom.xml"]
    },
    {
        "language":"cmake",
        "type":"BuildSystem",
        "fileNames":["cmakelists.txt"],
        "fileSuffixes": [".cmake"]
    },
    {
        "language":"makefile",
        "type":"BuildSystem",
        "fileNames":["makefile"],
        "fileSuffixes": [".make", ".mkfile", ".mak", ".mk"]
    },
    {
        "language":"qmake",
        "type":"BuildSystem",
        "fileSuffixes": [".pro", ".pri"]
    },
    {
        "language":"visual studio",
        "type":"BuildSystem",
        "fileSuffixes": [".sln", ".vcxproj", ".vcproj", ".props"]
    },
    {
        "language":"xcode",
        "type":"BuildSystem",
        "fileSuffixes": [".xcconfig", ".pbxproj", ".xcworkspacedata"]
    },
    {
        "language":"automake",
        "type":"BuildSystem",
        "fileSuffixes": [".am"]
    },
    {
        "language":"ninja",
        "type":"BuildSystem",
        "fileSuffixes": [".ninja"]
    },
    {
        "language":"vcpkg",
        "type":"BuildSystem",
        "fileSuffixes": [".vcpkg"]
    },
    {
        "language":"boost.jam",
        "type":"BuildSystem",
        "fileSuffixes": [".jam"]
    },
    {
        "language":"gradle",
        "type":"BuildSystem",
        "fileSuffixes": [".gradle"]
    },
    {
        "language":"bazel",
        "type":"BuildSystem",
        "fileSuffixes": [".bzl"]
    },
    {
        "language":"gyp",
        "type":"BuildSystem",
        "fileSuffixes": [".gyp", "gypi"]
    },
    {
        "language":"eslint",
        "type":"EnvConfig",
        "filePrefixes": [".eslintrc."]
    },
    {
        "language":"travis",
        "type":"EnvConfig",
        "fileNames":[".travis.yml"]
    },
    {
        "language":"appveyor",
        "type":"EnvConfig",
        "fileNames":["appveyor.yml"]
    },
    {
        "language":"gitlab",
        "type":"EnvConfig",
        "fileNames":[".gitlab-ci.yml"]
    },
    {
        "language":"circleci",
        "type":"EnvConfig",
        "fileNames":["circle.yml"]
    },
    {
        "language":"clangformat",
        "type":"EnvConfig",
        "fileNames":[".clang-format"]
    },
    {
        "language":"clang_complete",
        "type":"EnvConfig",
        "fileNames":[".clang_complete"]
    },
    {
        "language":"editorconfig",
        "type":"EnvConfig",
        "fileNames":[".editorconfig"]
    },
    {
        "language":"gdbinit",
        "type":"EnvConfig",
        "fileNames":[".gdbinit"]
    },
    {
        "language":"yard",
        "type":"EnvConfig",
        "fileNames":[".yardopts"]
    },
    {
        "language":"codecov.io",
        "type":"EnvConfig",
        "fileNames":[".codecov.yml"]
    },
    {
        "language":"pylint",
        "type":"EnvConfig",
        "fileNames":[".pylintrc"]
    },
    {
        "language":"flake8",
        "type":"EnvConfig",
        "fileNames":[".flake8"]
    },
    {
        "language":"emacs.dir-locals",
        "type":"EnvConfig",
        "fileNames":[".dir-locals.el"]
    },
    {
        "language":"doxygen",
        "type":"EnvConfig",
        "fileNames":["doxygen.config"]
    },
    {
        "language":"apache-2.0",
        "type":"License",
        "fileNames":["apache-2.0.txt"]
    },
    {
        "language":"agpl-3.0",
        "type":"License",
        "fileNames":["gnu-agpl-3.0.txt"]
    },

    {
        "language":"flatbuffers",
        "type":"Generator",
        "fileSuffixes": [".fbs"]
    },
    {
        "language":"cap'n proto",
        "type":"Generator",
        "fileSuffixes": [".capnp"]
    },
    {
        "language":"lex",
        "type":"Generator",
        "fileSuffixes": [".l", ".lex", ".ll"]
    },
    {
        "language":"yacc",
        "type":"Generator",
        "fileSuffixes": [".y", ".yacc", ".yxx"]
    },
    {
        "language":"m4",
        "type":"Generator",
        "fileSuffixes": [".m4"]
    }
]

or

{
    "brands": {
        "google":["google","angular","googlecloudplatform","googlechrome", "golang", "gwtproject", "zxing", "v8"],
        "twitter":["twbs", "twitter", "bower", "flightjs"],
        "facebook": ["facebook", "facebookarchive","boltsframework"],
        "github":["atom", "github"],
        "microsoft": ["microsoft", "dotnet", "aspnet", "exceptionless", "mono", "winjs"]
    },
    "keywords":{
        "node": ["node", "nodejs"],
        "jquery": ["jquery", "jq", "/^jq[\\-]?/"],
        "grunt": ["grunt", "gruntjs"],
        "angular": ["angular", "angularjs", "ng", "/^ng(?!inx)\\-]?/"],
        "ember": ["emberjs", "ember"],
        "meteor": ["meteor", "meteorjs"],
        "gulp": ["gulp"],
        "express": ["express", "expressjs"],
        "d3": ["d3"],
        "polymer": ["polymer"],
        "ionic": ["ionic"],
        "seajs": ["seajs"],
        "yeoman": ["yeoman"],
        "browserify": ["browserify"],
        "requirejs": ["requirejs"],
        "underscore": ["underscore", "underscorejs"],
        "modernizr": ["modernizr"],
        "phantom": ["phantom", "phantomjs"],
        "metalsmith": ["metalsmith"],

        "bootstrap": ["bootstrap"],

        "django": ["django"],
        "bottle": ["bottlepy", "bottle"],
        "web2py": ["web2py"],
        "webpy": ["webpy"],
        "flask": ["flask"],
        "ipython": ["ipython"],
        "fabric": ["fabric"],
        "celery": ["celery"],

        "language/python": ["python", "/^py/"],
        "language/ruby": ["ruby"],
        "language/clojure": ["clojure"],
        "language/lisp": ["lisp"],
        "language/rust": ["rust"],
        "language/erlang": ["erlang"],
        "language/go": ["golang", "go"],
        "language/javascript": ["javascript", "js"],
        "language/clojure": ["coffeescript"],
        "language/php": ["php"],
        "language/perl": ["perl"],
        "language/swift": ["swift"],
        "language/css": ["css", "stylesheet"],

        "ios": ["ios"],
        "osx": ["osx"],
        "unix": ["unix"],
        "android": ["android"],
        "linux": ["linux"],
        "windows": ["windows"],

        "deprecated": ["deprecated"],
        "pdf": ["pdf"],
        "polyfill": ["polyfill"],
        "framework": ["framework"],
        "dropbox": ["dropbox"],
        "webkit": ["webkit"],
        "sql": ["sql"],
        "svg": ["svg"],
        "boilerplate": ["boilerplate", "seed"],
        "rails": ["rails", "rails3"],
        "vim": ["vim", "vi"],
        "git": ["git"],
        "backbone": ["backbone"],
        "docker": ["docker"],
        "emacs": ["emacs"],
        "redis": ["redis"],
        "chrome": ["chrome"],
        "sublime": ["sublime"],
        "vagrant": ["vagrant"],
        "wordpress": ["wordpress", "/^wp\\-/"],
        "youtube": ["youtube"],
        "apache": ["apache"],
        "jekyll": ["jekyll"],
        "puppet": ["puppet"],
        "sass": ["sass", "scss"],
        "nginx": ["nginx"],
        "markdown": ["markdown"],
        "elasticsearch": ["elasticsearch"],
        "chef": ["chef"],
        "mongodb": ["mongodb", "mongo"],
        "cordova": ["cordova"],
        "phonegap": ["phonegap"],
        "ansible": ["ansible"],
        "openshift": ["openshift"],
        "mysql": ["mysql"],
        "couchbase": ["couchbase"],
        "firebase": ["firebase"],
        "homebrew": ["homebrew"],
        "openstack": ["openstack"],
        "maven": ["maven"],
        "hadoop": ["hadoop"],
        "spark": ["spark"],
        "jasmine": ["jasmine"],
        "hubot": ["hubot"],
        "jruby": ["jruby"],
        "couchdb": ["couchdb"],
        "travis": ["travis"],
        "bash": ["bash"],
        "coreos": ["coreos"],
        "mustache": ["mustache"],
        "zsh": ["zsh"],
        "jenkins": ["jenkins"],
        "cassandra": ["cassandra"],
        "statsd": ["statsd"],
        "eclipse": ["eclipse"],
        "knockout": ["knockout"],
        "graphite": ["graphite"],
        "textmate": ["textmate"],
        "jed": ["jed"],
        "memcached": ["memcached"],
        "mesos": ["mesos"],
        "rabbitmq": ["rabbitmq"],
        "firefox": ["firefox", "ff"],
        "postgres": ["postgres", "postgresql"],
        "selenium": ["selenium"],
        "gems": ["gems", "rubygems"],
        "zeromq": ["zeromq", "zmq", "0mq"],
        "tmux": ["tmux"],
        "cyanogenmod": ["cyanogenmod"],
        "tornado": ["tornado"],
        "octopress": ["octopress"],
        "dokku": ["dokku"],
        "karma": ["karma"],
        "bitcoin": ["bitcoin"],
        "handlebars": ["handlebars"],
        "qt": ["qt"],
        "minecraft": ["minecraft"],
        "unity": ["unity"],
        "cocos2d": ["cocos2d"],
        "openssl": ["openssl"],
        "amqp": ["amqp"],
        "logstash": ["logstash"],
        "sqlite": ["sqlite"],
        "v8": ["v8"],
        "fuse": ["fuse"],
        "cocoa": ["cocoa"],
        "curl": ["curl"],
        "ffmpeg": ["ffmpeg"],
        "hhvm": ["hhvm"],
        "rake": ["rake"],
        "drupal": ["drupal"],
        "gevent": ["gevent"],
        "nagios": ["nagios"],
        "chromium": ["chromium"],
        "jenkinsci": ["jenkinsci"],
        "etcd": ["etcd"],
        "kubernetes": ["kubernetes"],
        "react": ["react", "reactjs"]
    }
}

refs.

hanwen commented 6 years ago

What do you want to search for exactly? Per-repository data or per-file data? Let's call them tags

Do you want full-text stringsearch of the tags, regex search, or only exact matches?

languages are already supported, see https://cs.bazel.build/search?q=lang%3Apython

roscopecoltran commented 6 years ago

Hi,

Thanks for the reply !

In fact both as I would like to index my $GOPATH/src directory as I clone any kind of repos in it; nodejs/java/python repositories... It allows me to keep my repositories organized by repo uris; so I just wanted to have on the left side, some filters allowing me to filter my vcs provider, owner, project name.

To make it simple, just wanted to upgrade zoekt to a webui closer to searchcode-server (https://github.com/boyter/searchcode-server)

Video

Example of left filtering blocks of matched files:

Filter by VCS provider:

Filter by namespace (org/user):

Filter by languages:

Filter by filetypes:

Filter by topics:

Hope i made more clear my idea, thanks in advance for your time and reply.

Cheers, Rosco

hanwen commented 6 years ago

"just wanted to upgrade zoekt to a webui closer to searchcode-server "

I don't know enough about building Web UIs that I could pull that off, but I'm happy to review changes.

I could add something to the individual results to add restrictions (this repo, this directory, this language, this branch). Would that help?

roscopecoltran commented 6 years ago

Yes, that would... I can do the webui... That s not a problem...

Summary: Would be awesome to CRUD some metadata/tags, as global for a repository, or specific to a file; to an already existing index or while creating a new one.

Eg. I could use the go-github package to fetch topics defined for a repository and enrich the restrictions of search results based on the owner defined topics. (ref. https://github.com/google/go-github/blob/master/github/repos.go#L58). Then I will do the disambiguation with the jargon package.

This pipeline could be queued and triggered separately, but the most important is to have some methods in zoekt to manage these extra data, for a repo or a file, in an already existing Zoekt's index file. I guess that it would be complicated to rebuild the index each time if you index more than 1000 repos...

If you do not mind, let me draft you an example/poc, in my forked version of zoekt ^^, of my poc, so I will send you a link in 1 to 2 hours... :-)

Thanks for your patience

hanwen commented 6 years ago

please send me a change through gerrit, as described here:

https://github.com/google/zoekt/blob/master/CONTRIBUTING

hanwen commented 6 years ago

for per-repository data, things are simple. There is already a pipeline for inserting metadata,

https://github.com/google/zoekt/blob/8e284ca7e96491aee468943656a3ec8c5389e1b1/api.go#L196

it only needs a query operator to implement it. And you have to find a way to ingest this data from a given (git) repository. Currently, only git-config settings are imported as repo metadata.