Distributed backend to make OpenGrok scale on large deployments

MetaGerCodeSearch commented 10 years ago

While working on the MetaGer CodeSearch deployment ( http://code.metager.de/source/ ), we seem to have hit some road blocks for a single machine.

We cover over 3100 repos with about 500GB of sources total, all done on a machine with 48GB RAM and 24 cores. Often times the Java/Tomcat/OpenGrok combination will simply break down and hog all available CPUs and RAM, probably due to some bugs that haven't received much exposure yet. Another factor may be that indexsearchers are spawned for every repo as far as my understanding goes. I understand this might be a deployment size not everyone is willing to reach (or test for). :)

Using a distributed/clustering approach of several Tomcats could prove beneficial to OpenGrok installations such as ours, but maybe others with a smaller number of repos, files and hard drive space might find it helpful as well.

Thanks, Chris, CodeSearch project

cdgz commented 10 years ago

Hi,

As you might know, the goal of indexer is to process only fresh files. The fact that it takes all available CPU/RAM on every cron is definitely a misconfiguration - unless the activity on your projects is incredibly high, and nightly deltas are constantly big.

I have dealt with some big OpenGrok installations, similar to what you've mentioned (even a bit heavier). From my experience, with perfect tuning usual indexing of fresh code took from 10 to 20 mins, with minimal resources consumption (RAM jumped a bit during indexing, that's all).

The only advice is to track indexer logs ({{${DATA_ROOT}/../log/opengrok.1.0.log}}) during it's work. Output is rather verbose, and can give precious information on where it's stuck/passes most of the time. Maybe some files/paths should be added to IGNORE_PATTERNS?

Note that in old versions, if you are not using derby to store history cache - it's regenerated from scratch, every night. This might be painful if your repos have long VCS history. I have faced it in 0.10, not sure if it is fixed now (more details in this discussion). Take in account Trond's notes about flip-flop indexing pattern (having 2 copies of index, switching between them only during reindex).

The last and maybe the least: what is your FS? OpenGrok stores and process all lucene queries on disk, that's why some combinations (Solaris//ZFS) are more preferable than others (Linux/ext3) in terms of global perf.

vladak commented 10 years ago

File based history index is regenerated incrementally since #305 which is in 0.12 except it needs a fix for #818 so 0.12 will need a respin.

vladak commented 7 years ago

The problem with not reusing indexsearchers will be addressed in 0.13 (#1186 and others).

vladak commented 7 years ago

It seems to me that in order to do this the history cache would have to be moved to the Lucene index as well (otherwise some distributed file-system would have to be used). After all, documents have 1:1 mapping to source code files for both index and history cache so why not to unite these two ? @tarzanek any insights ?

tarzanek commented 4 years ago

the Q is whether lucene as backend can be configured to properly support distributed nosql, but solr and elasticsearch do it, so how to do it is obvious(or look at scylla, aerospike, couchdb, hbase, mongo, redis, ...) doing so will need to have bigger RF for data, do proper sharding, so distributed opengrok depends on distributing data and configuring lucene or leveraging solr(or alike)

idodeclare commented 4 years ago

@tarzanek, solr seems most tractable, but I would hope to see reworked the internal APIs so either local Lucene or distributed solr is a choice of the user.

vladak commented 4 years ago

I never operated distributed back end however I quite like what is presented on https://github.blog/2019-03-05-vulcanizer-a-library-for-operating-elasticsearch/

oracle / opengrok

Distributed backend to make OpenGrok scale on large deployments #779