generate historycache for directories

totembe commented 7 years ago

I have one particular very active project that has 9265 commits which spans to 371 pages. It takes about 15 seconds to load each page. Project git repository size ise 890MB.

I have another project with total of 2717 commits which takes to load 1-2 seconds to load which is far better and acceptable duration. This project is sized about 130MB.

I have another project with total of 6955 commits. Sized 93MB and takes to load 1-2 seconds for each page.

It seems when size of git repository increases duration of load time increases. Is it possible to optimize this?

Running Virtualbox on SSD disk with Intel(R) Core(TM) i7-6700 CPU @ 3.40GHz CPU with 64gb ram. Allocated 4 cores with 100% threshold and 16gb of ram.

vladak commented 7 years ago

Do you use history cache ? If you run the git log commands by hand, do the times correspond to the times observed in OpenGrok ?

totembe commented 7 years ago

git log returns instant.

I don't know any history cache setting. I am using OpenGrok with stock settings. There is historycache directory under /var/opengrok/data folder.

1.7G ./index 788M ./historycache 1.6G ./xref 4.0G .

vladak commented 7 years ago

That's strange. Could you inspect/instrument what the systems (both client and server) are doing during those 15 seconds ? e.g. is it the client that is CPU loaded or the server is performing some heavy I/O ?

Also, what OpenGrok version are you running ? Since #1049 even if the history for given directory/file is very long, the output will be paginated so should not take too long to display/render.

totembe commented 7 years ago

Both requesting client and opengrok server are located on my personal office machine. Machine ise mostly idle. OpenGrok is served to a small team of 10 person.

I marked my request start with S and page render end as E.

I am using OpenGrok-1.1-rc8.

totembe commented 7 years ago

I found the culprit. As you pointed out git is source of delay. When I dumped git log to txt file it took 9 seconds. I failed to spot this at first because git dumps log to less directly and displays instantly. I thought process ends and displays afterwards.

time /opt/git/bin/git log --abbrev-commit --abbrev=8 --name-only --pretty=fuller --date=iso8601-strict > /home/ethem/test.txt

real 0m8.994s user 0m7.372s sys 0m1.588s

Instead of full dump, Log dump can be optimized by using --skip option and cutting pipe when required data is acquired.

I.e: time /opt/git/bin/git log --abbrev-commit --abbrev=8 --name-only --pretty=fuller --date=iso8601-strict --skip=10 | head -n 100 > /home/ethem/test.txt

real 0m0.014s user 0m0.008s sys 0m0.004s

totembe commented 7 years ago

As I am new to git, I understand now that this is git problem.

With repacking repository with git repack -a option, duration decreased from 8 seconds to 2.5 seconds. 2.5 seconds is not responsive enough to navigate between history log pages. Log file itself is a few megabytes file (In my instance it is 3.2megs), git log output can be cached and OpenGrok can use this file..

vladak commented 7 years ago

Well, if history cache is used, running git log when displaying history view page is avoided because history is stored in a compressed XML file (that basically represents a set of HistoryEntry objects) under the historycache directory.

Do you run the indexer with the -H option ?

totembe commented 7 years ago

I tried with

sudo OPENGROK_GENERATE_HISTORY=on /opt/opengrok-1.1-rc8/bin/OpenGrok index /home/ethem/og/src

I couldnt find any compressed xml files under /var/opengrok/data/historycache

Am I missing something?

vladak commented 7 years ago

So what are these 788M in the historycache directory ?

If you have a project called foo, that has a file bar.txt located directly in it (i.e. /var/opengrok/src/foo/bar.txt exists and /var/opengrok/src/foo is a Git/Mercurial/... repository), its historycache entry will be located in file /var/opengrok/data/historycache/foo/bar.txt.gz. If the file is not there, history view in the webapp will have to resort to running git log directly, hence the delay you're seeing.

It seems that history cache generation failed for some reason. Do indexer logs contain anything of interest ?

totembe commented 7 years ago

There are gz files for each code file in our projects. While browsing code file history, i hadn't any issue, I think those files are caches for single file.

I had problems of browsing history of entire repository.

URL (takes time): http://opengrokhost/source/history/foo

URL(no problems at the moment): http://opengrokhost/source/history/foo/bar.txt

vladak commented 7 years ago

Aha ! :-) The per-directory history is not cached, so git log is run every time. The reason is the history cache creation. It uses a trick to convert per-repository history into per file history by mapping changesets to files changed in them and then inverting this map. I am not sure this trick can be used for creating per-directory history cache. If yes, it will certainly demand more space for storing the historycache and it will make reindex longer too.

The other option is to create the history cache for directories on demand. Thus only the first display will take long time and subsequent displays (leveraging the incremental history generation using the OpenGroklatestRev file) will be fast (as long as not too many history entries are added).

vladak commented 7 years ago

Another idea would be to store historycache at least for top-level directory of given repository since it is available anyway, i.e. change FileHistoryCache.java#store() to store the history parameter in a file. Filed #1716 to track this.

vladak commented 7 years ago

The reason for why history is not cached for directories is given in FileHistoryCache.java#get():

            // Don't cache history-information for directories, since the
            // history information on the directory may change if a file in
            // a sub-directory change. This will cause us to present a stale
            // history log until a the current directory is updated and
            // invalidates the cache entry.

So if directory cache is implemented, that would mean traversing the directory hierarchy all the way up from the changed file and invalidating all directory cache entries. Or devising better solution.

totembe commented 7 years ago

latest revision hash can be parsed by git log. after parsing first record, close the command output pipe because we don't need rest of records which gives performance boost. after getting latest revision hash, it can be compared revision hash associated with history cache, and if it is not equal then cache can be invalidated and new cache can be generated. I tried for subdirectories and git log works.

vladak commented 7 years ago

The latest changeset is easy to acquire via git log -n1 (plus some templating), no need to close the pipe.

Anyhow, there are (at least) 2 different ways how to approach this:

do not cache anything and just cut the number of log entries using the -noption of git log. If the first page is displayed, cfg.getSearchMaxItems() changesets will have to fetched, second page double that, etc. This would work only if there is a cheap way how to retrieve number of changesets for given directory since it is needed to construct the slider in history.jsp:
```
    // We have a lots of results to show: create a slider for them
    request.setAttribute("history.jsp-slider", Util.createSlider(start, max, totalHits, request));
```
get the full history and cache it on the first request. Invalidate+refetch the history using above described approach.

The first option has the advantage that it might be fast for first couple of history pages but it will get progressively worse (assuming the history is not cached for the session). Also for each page, git log will have to be called.

The advantage of the second option is that once cached and valid, the history fetch will be quick. However, the first request will be always slow. Also, if the repository changes often and reindex is done often too, the cache will be mostly invalid, saving no time.

totembe commented 7 years ago

method seems better

getSearchMaxItems can be acquired with git rev-list --count --all $subdir

Edit: For git root directory, omitting $subdir is better in performance terms.

$ time git rev-list --count --all 9304

real 0m0.041s user 0m0.036s sys 0m0.004s $ time git rev-list --count --all . 9304

real 0m0.686s user 0m0.612s sys 0m0.068s

performance wont degrade with skip option

git log -n $history_per_page --skip ($page-1) * $history_per_page $subdir (I tried "0" for first page and it works)

vladak commented 7 years ago

Well, it should work not only for git; ideally for other SCMs that support per-directory history retrieval.

oracle / opengrok

generate historycache for directories #1704