oracle / opengrok

OpenGrok is a fast and usable source code search and cross reference engine, written in Java
http://oracle.github.io/opengrok/
Other
4.34k stars 745 forks source link

OpenGrok index performance.. #802

Closed martin83tmp closed 5 years ago

martin83tmp commented 10 years ago

Hi, OpenGrok experts.

I'm martin. I manage five android projects by using OpenGrok in my company. And 'OpenGrok index src_folder' every day to update new source on OpenGrok. It seems that indexing speed is not enough to using my android projects. (I know android projects are too heavy to index.)

android_prj_1
android_prj_2
android_prj_3
android_prj_4
android_prj_5

It takes too much time to indexing. Is there a way to use 'OpenGrok index' as multiple threads? Or could you help me to deal with this? or Any Idea?

MetaGerCodeSearch commented 10 years ago

Hello Martin,

I'm not an expert on OpenGrok, but will try to answer things.

It depends a lot on how large your projects are, and what the contents are (lots of history and such). If they're smaller projects this performance isn't to be expected, and adjusting the configuration is advisable.

First, the OpenGrok shell script usually starts a couple 'Exuberant CTags' tasks. There's already multithreading in it. :)

Second, check if and how you generate history information. With lots of history entries it might be necessary to use Apache Derby (JavaDB) for the history entries, instead of the file system.

Also, there's a line in OpenGrok where you may adjust the RAM usage of the indexer process. Feel free to experiment if you have enough memory available - see https://github.com/OpenGrok/OpenGrok/blob/master/OpenGrok#L296 for the line.

martin83tmp commented 10 years ago

Many thans for your advise.

As you mentioned, I clearly deleted a way of multithread indexing in my head. :) I will experiment RAM usage of the indexer process. But, one more, I do not know how to check whether I use JavaDB or not. Could you one more time please tell me how to check this?

MetaGerCodeSearch commented 10 years ago

Check if your installation uses OPENGROK_DERBY=True or something similar in the environment variables of the user running OpenGrok. Also, Apache Derby / JavaDB needs to be installed in your system. By default it will run on Port 1527 (localhost) as a java process. The process should show as "/usr/local/bin/java -Dderby.system.home=..." or something like that, it depends on your operating system.

If you don't remember installing and activating it, chances are you don't have Derby running at all. In that case the indexer will produce history entries in the filesystem, which can be a bottleneck.

If you have other questions, please describe your installation in detail including operating system and storage when asking. :)

vladak commented 10 years ago

Basically, indexing works in 2 phases:

In the history index, the history of all changesets is generated/refreshed. This is done by running log command of the particular SCM on the top level directory (for those SCMs which support it - I assume your repos are SVN ?), parsing the output and storing it in history cache (be it file based or JDBC based). This is completely parallelized. In the Lucene index phase, Exuberant ctags are used to generate information about symbols in the source code, xrefs are generated and the tokens and history are stored in Lucene index. This step is also parallelized. There are various tunables which can be used to increase the level of parallelism for both phases. They are based on the number of online CPUs in the system.

According to the above it seems you are not talking about initial reindex but rather incremental reindex - plase confirm.

Also, what OpenGrok version are you using ? What is the exact command you are using to do the reindexing ?

Lastly, do you see performance degradation over time or in other words - has it been always like that ? What is the performance you are expecting and what is the performance you are getting ?

Also, take a look at https://github.com/OpenGrok/OpenGrok/blob/master/README.txt#L831 (Tuning OpenGrok for large code bases)

vladak commented 10 years ago

if you are running 0.12-rc7+ you can see the elapsed time of the first indexing phase by doing (assuming OpenGrok lives in standard location under /var/opengrok):

grep 'Done historycache for all' /var/opengrok/log/*.log

As for the second phase, I will make the indexer to put something in the logs.

vladak commented 10 years ago

Also, making sure that the hardware can scale is a good thing. The workload generated by OpenGrok is very much mixed - in the history phase of the index the SCMs itself consume lots of CPU cycles and I/O by querying its metadata and then again lots of CPU power and I/O to encode the history objects into XML and storing them on disk (in case of file based history cache - for JDBC it would be possibly more CPU bound). Then in the Lucene index phase lots of read I/O is done to traverse all the repositories and feed the sources into ctags which in turn consume CPU cycles by parsing the files and then of course Lucene takes some CPU cycles and I/O to construct the index and write it to disk.

Depending on the OS it is good idea to have enough RAM for system buffers (especially buffer cache for the file system). Some serious I/O backend is also handy. For our internal deployment we are using 2 separate disk arrays (one for source code, one for index data) each with 10k RPM disks organized into RAID5 equivalent groups connected to machine with 48GB of RAM and 24 CPU threads, each running at 3GHz. The code indexed is far larger than 5 repositories though.

tarzanek commented 10 years ago

https://github.com/OpenGrok/OpenGrok/blob/master/README.txt#L831 might help as well if you're on 0.12RCs

naseer commented 10 years ago

We index several Android repositories every night as well. I have moved to the 0.12 rcs and seen significant performance improvements. A few more things we have done is to eliminate history indexing for most repos except the main one, Tune the settings as suggested by @tarzanek above, Get a faster machine and storage :)

tarzanek commented 10 years ago

@naseer I was hoping history parsing is faster than in 0.11, too ...

naseer commented 10 years ago

@tarzanek It is better but it does consume a lot of CPU (with Java DB) especially on branches with a lot of history like the kernel or frameworks/base - we just chose to disable it for older branches that don't really need it.

tulinkry commented 5 years ago

No reaction from the author for years. Use current latest version 1.1.2 which contains different parallelization approach for indexing, thus making it faster. A candidate for closing.