ChristopheBordieu commented 3 years ago

Is your feature request related to a problem? Please describe. Let's write yes. I am reindexing all my company Git repos (30k) with latest OG 1.5.11 / Tomcat 9 / Java 11 / RHEL 8.3. Historycache generation is extremely long...

Describe the solution you'd like I would like when reindexing: 1a run indexer without historycache 1b run another indexer process in parallel with historycache only that's to say skipping file indexing since this is done in a short time by process of step 1a 2 each night run indexer without historycache 3 when historycache started in step 1b finishes ok, then and only then, update job run nightly in step 2 adding historycache too

Describe alternatives you've considered Optimizing the historycache generation? Issue to be opened further to your answer to this one. But in all case, because of the size of our code sources, historycache process has always been very long.

Additional context Some of our repos have very long history, with many tags and also many files. This does not help generating quickly the history.

vladak commented 3 years ago

The history cache update for Git is incremental so the long historycache run has to be caused by either the sheer amount of repositories and/or the processing overhead. I wonder if replacing the executions of the git binary in the GitRepository class with pure Java implementation (say Eclipse JGit #1113) would help things (making it hopefully more I/O bound).

louie0817 commented 3 years ago

The history cache update for Git is incremental so the long historycache run has to be caused by either the sheer amount of repositories and/or the processing overhead. I wonder if replacing the executions of the git binary in the GitRepository class with pure Java implementation (say Eclipse JGit) would help things (making it hopefully more I/O bound).

Vlad, can you elaborate on "incremental" please. We have a huge GHE repo that has over 50k commits. at the top level, the uncompressed OpenGrokDirHist.gz file is over 4 GB. yesterday just one file was updated in the top level dir and about 8 files change in various subdirs. The history cache generation for just the top level dir, took 1 hour and 40 minutes and this happens everynight as they update the file every night. Incremental to me would to be run git log and stop when I hit the sha1 saved in the OpenGroklatestRev file.

vladak commented 3 years ago

Vlad, can you elaborate on "incremental" please. We have a huge GHE repo that has over 50k commits. at the top level, the uncompressed OpenGrokDirHist.gz file is over 4 GB. yesterday just one file was updated in the top level dir and about 8 files change in various subdirs. The history cache generation for just the top level dir, took 1 hour and 40 minutes and this happens everynight as they update the file every night. Incremental to me would to be run git log and stop when I hit the sha1 saved in the OpenGroklatestRev file.

When the history cache is being refreshed, the new history is appended to the existing history. In practice this means that for each .gz file it needs to be decompressed, deserialized (from the XML form) into a History object, the new history is appended there in the form of HistoryEntry objects, then it is serialized into XML form and compressed again. This happens for each affected file hence for the top level history as well. What may be happening is that for the the big top level file that holds the complete history of the repository the (de)serialization/(de)compression (combined with the I/O) is taking too long. Maybe there should be a tunable to disable the top-level history. Or find another, more efficient way to represent it.

vladak commented 3 years ago

I'd like to know what is it during history cache refresh that takes so much time. PR #3438 provides logging (FINER log level) of the refresh times for individual files and the top level directory.

ChristopheBordieu commented 3 years ago

@vladak Is this question for my runs? Because at the moment, no history indexing is done.

vladak commented 3 years ago

Yeah, I was wondering if there are any outliers in the history cache generation.

ChristopheBordieu commented 3 years ago

Ok. Will come back to this issue when the new one I have just opened is fixed on my side. Since actually, even without history cache generation, I am not able to finish to index my code.

ChristopheBordieu commented 3 years ago

Hi @vladak

I am coming back to this issue. I have started a new run (OG 1.6.3) allowing history (-P -S -G -H, -m 256 -Xmx20g -server).

Run is still ongoing. It is super long... Indexing seems stuck but it is not: all files are processed but very slowly. I have logs like:

...
2021-03-23 10:01:20.028+0000 FINER t243645 Statistics.logIt: Done storing history cache for /PROJ/gcc/.../file1.c (took 0:52:00)
2021-03-23 10:22:44.594+0000 FINER t243645 Statistics.logIt: Done storing history cache for /PROJ/.../file2.c (took 0:20:43)
2021-03-23 10:36:29.517+0000 FINER t243645 Statistics.logIt: Done storing history cache for /PROJ/.../file3.adb (took 0:13:13)
...

Some files are storing history cache in more than 1h40. For each new file that is logged, duration for this step is at least ~10 minnutes.

With htop, I see only threads dealing with GC that are around 100%. Don't know if it is normal. I attach the threads dump here.

Could you help me? Are my indexer options badly set? My Xmx too big? Too small? Is it an issue because of the size of history for some repos? Number of Git tags?

I have no idea how to improve runtime.

threads.dump.txt.gz

vladak commented 3 years ago

If the GC is eating most of the CPUs cycles, it is time to raise the heap size I'd say even though I'd expect OutOfMemoryException to be thrown. Does it take minutes to store history for all files or just for some ? How big is the history size for these files ?

If you want to see what is happening with the heap during reindex, there is a way how to do that via StatsD - see https://github.com/oracle/opengrok/wiki/Monitoring#indexer

vladak commented 3 years ago

What about the system health metrics ? Free memory ? Is the system swapping perhaps ? The heap size should be accounted in the system memory (RAM + swap), esp. on systems that employ memory reservations.

ChristopheBordieu commented 3 years ago

No OOM exception. All files take minutes now to store history. At the very beginning, it was quick. The history files, once created, are not big. Example: text file with size ~500k took ~1h to write history file .gz with size ~4k. Swap disabled on the VM. Total 32G. Free ~200M. Used 22G. Buffer/cache ~8G. Available ~8G.

vladak commented 3 years ago

Looking at the stack dump, the times of the GC threads are indeed "interesting", e.g.:

"GC Thread#4" os_prio=0 cpu=53736794.31ms elapsed=61344.07s tid=0x00007f7b04005000 nid=0x15358b runnable  

"GC Thread#5" os_prio=0 cpu=53767817.04ms elapsed=61344.07s tid=0x00007f7b04006800 nid=0x15358c runnable

How many CPUs are in the VM ? Maybe you can also try lowering the number of indexer+history threads.

Also, you can get some basic heap statistics using jmap.

ChristopheBordieu commented 3 years ago

VM has 10 CPUs. I generated a heap dump (33G!). A colleague will try to look at it. All long times are coming from 3 repos: I deleted them and will see if it helps.

vladak commented 3 years ago

Good luck. We can recommend YourKit for heap analysis.

ChristopheBordieu commented 3 years ago

Thanks :-) What is the file in each repo at root level OpenGrokDirHist.gz? Is it an aggregation of all the history of all the files?

vladak commented 3 years ago

That's complete history of the repository.

ChristopheBordieu commented 3 years ago

Ok. Coming from a git log command I guess. So, even if history is very long, this file should be created quite quickly.

vladak commented 3 years ago

Yep. There is definitely some overhead associated with the processing of this file (I/O, XML encoding/decoding). I was contemplating whether to introduce a tunable to disable it.

ChristopheBordieu commented 3 years ago

When is it used? When in xref part at root level with button History?

vladak commented 3 years ago

Yes. Also when generating the history during reindex I think.

ChristopheBordieu commented 3 years ago

Is this file supposed to have size in the order of GB? For a repo with huge history and huge number of merge commits with long comment (diff; because of the -m option which is used to generate it). I think this is the reason why heap memory is consumed and so GC threads always running slowing down the indexing. History is so huge that it ends when I run the exact same comand as OG with warnings: warning: inexact rename detection was skipped due to too many files. warning: you may want to set your diff.renameLimit variable to at least 15568 and retry the command.

Is it possible you limit the number of commits you want to retrieve for any repo? I am interested by a tunable to disable or limit this file.

ChristopheBordieu commented 3 years ago

Why is the -m used?

I have maybe missed an option to get rid of renaming while getting history. Could you please remind me what it is? --renamedHistory on|off Enable or disable generating history for renamed files. If set to on, makes history indexing slower for repositories with lots of renamed files. Default is off. So it is ok. Renaming should not be done. So why the -m?

vladak commented 3 years ago

Is this file supposed to have size in the order of GB? For a repo with huge history and huge number of merge commits with long comment (diff; because of the -m option which is used to generate it).

Yes, for repositories with rich history. It is the complete history of the repository serialized as XML.

I think this is the reason why heap memory is consumed and so GC threads always running slowing down the indexing.

It might be, better to perform heap analysis first before drawing conclusions. However the history still needs to be generated for the initial index.

History is so huge that it ends when I run the exact same comand as OG with warnings: warning: inexact rename detection was skipped due to too many files. warning: you may want to set your diff.renameLimit variable to at least 15568 and retry the command.

These warnings come from Git I believe.

Is it possible you limit the number of commits you want to retrieve for any repo?

I'd rather not introduce a tunable that would set it as it could cause confusion.

I am interested by a tunable to disable or limit this file.

3243 tracks the general problem.

vladak commented 3 years ago

Why is the -m used?

This is Lucene specific tunable. It might give index writing a boost I believe.

ChristopheBordieu commented 3 years ago

This is Lucene specific tunable. It might give index writing a boost I believe.

Nop this one. The -m which is used when indexer runs git log command for the repository:

2021-03-23 17:19:59.280+0000 FINE t243626 Executor.exec: Executing command [/usr/bin/git log --abbrev-commit --abbrev=8 --name-only --pretty=fuller --date=iso8601-strict -m -- /srv/opengrok-src/PROJ/repo]

If I understand well, this argument will provide the list of all files that were updated by a merge commit. In our case, the list can be long (and so the warning of git) and so maybe it increases the size of data to manage into memory for the XML generation.

vladak commented 3 years ago

Yes, the git -m option causes all kinds of problems for repositories with large history - #3243 as well as #3367. In the short term I plan to introduce a tunable to disable it.

ChristopheBordieu commented 3 years ago

Problems described in #3243 are indeed exactly the same as mine.

ChristopheBordieu commented 3 years ago

Hi @vladak I have a MAT session ongoing. What would you like I look for?

vladak commented 3 years ago

First thing to look for is to find how much heap space is used and how much is eligible for garbage collection. Then try to find the biggest chunks of the used space, identify to which objects they correspond.

ChristopheBordieu commented 3 years ago

FYI. In //, I have removed all huge repos (2 with their forks): history has been generated and indexing is close to finish. My issue is these repos I have removed have to be indexed. So, very likely, I will come back to 1.3.16 for my final runs.

vladak commented 3 years ago

Note that 1.6.5 has a tunable to disable Git merge commits. This should help with memory consumption.

ChristopheBordieu commented 3 years ago

Ok. Will test this on my STG environment. Will let you know.

ChristopheBordieu commented 3 years ago

Hi @vladak

Many thanks for 1.6.5 and new parameter to disable Git merge commits! I am now able to get full history cache data generated :-) The MAT session showed biggest threads were processing these repos with huge history/merge commits/files.

I close this issue since my original issue is fixed by 1.6.5.

I have still some troubles (OOM killer) pushing the new configuration to OpenGrok service (sometimes it finishes, sometime not) at the end of indexing job. Will open a new issue if I cannot fix this myself.

Again, many thanks for your help!

oracle / opengrok

Indexer option to run historycache only #3434

3243 tracks the general problem.