oracle / opengrok

OpenGrok is a fast and usable source code search and cross reference engine, written in Java
http://oracle.github.io/opengrok/
Other
4.33k stars 746 forks source link

Can't index 19GB source files with OpenGrok1.0 #2407

Open Lokendra-Saini opened 5 years ago

Lokendra-Saini commented 5 years ago

Hello all,

I am trying to index 19 GB source files with Opengrok-1.0. But indexing is failing every time. Following is the relevant information Java - jdk1.8.0_51 Ant - apache-ant-1.9.0 Source files - 19GB (Single project) My changes to OpenGrok Script " _OPENGROK_GENERATEHISTORY=off _OPENGROK_SCANREPOS=false _OPENGROKVERBOSE=true _OPENGROK_FLUSH_RAM_BUFFERSIZE="-m 256"

_export OPENGROK_GENERATEHISTORY _export OPENGROK_SCANREPOS _export OPENGROKVERBOSE _export OPENGROK_FLUSH_RAM_BUFFERSIZE _JAVA_OPTS="${JAVAOPTS:--Xmx32g -d64 -server}" "

I am using a shared machine on which I am allowed to use up to 32GB of RAM. But the indexing process gets killed because it reaches memory usage limit. I also tried _JAVA_OPTS="${JAVAOPTS:--Xmx16g -d64 -server}", however this setting throws "java.lang.OutOfMemoryError: Java heap space" error.

Please help me with setting up OpenGrok to index large sources.

Thanks in advance!!

vladak commented 5 years ago

Could you try with the latest 1.1 rc ?

Dne st 10. 10. 2018 17:56 uživatel Lokendra-Saini notifications@github.com napsal:

Hello all,

I am trying to index 19 GB source files with Opengrok-1.0. But indexing is failing every time. Following is the relevant information Java - jdk1.8.0_51 Ant - apache-ant-1.9.0 Source files - 19GB (Single project) My changes to OpenGrok Script " OPENGROK_GENERATE_HISTORY=off OPENGROK_SCAN_REPOS=false OPENGROK_VERBOSE=true OPENGROK_FLUSH_RAM_BUFFER_SIZE="-m 256"

export OPENGROK_GENERATE_HISTORY export OPENGROK_SCAN_REPOS export OPENGROK_VERBOSE export OPENGROK_FLUSH_RAM_BUFFER_SIZE JAVA_OPTS="${JAVA_OPTS:--Xmx32g -d64 -server}" "

I am using a shared machine on which I am allowed to use up to 32GB of RAM. But the indexing process gets killed because it reaches memory usage limit. I also tried JAVA_OPTS="${JAVA_OPTS:--Xmx16g -d64 -server}", however this setting throws "java.lang.OutOfMemoryError: Java heap space" error.

Please help me with setting up OpenGrok to index large sources.

Thanks in advance!!

— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub https://github.com/oracle/opengrok/issues/2407, or mute the thread https://github.com/notifications/unsubscribe-auth/ACzGDNhPPGbWAJmh1G-ZerBzy6UPfli0ks5ujhiGgaJpZM4XVpkb .

Lokendra-Saini commented 5 years ago

I need to use OpenGrok-1.0 for my use-case. Could you please suggest any changes for the same. I will also try OpenGrok1.1 and will update here.

vladak commented 5 years ago

If you really need to use 1.0 then run the indexer with -XX:+HeapDumpOnOutOfMemoryError (and -XX:HeapDumpPath= to specify where the file should be saved) and take a look at the dump with MAT (https://www.eclipse.org/mat/) to see what are the biggest consumers.

vladak commented 5 years ago

Any luck ?

Shooter3k commented 5 years ago

Our repo is around 450GB. With version 1.0 or older, I've found it's best to create a small index (with a subset of files) and/or remove the files that cause the memory errors until you can get a full index to create and then slowly take out the ignore files options.

vladak commented 5 years ago

What are these files ?

Dne út 16. 10. 2018 19:26 uživatel Jake VanEck notifications@github.com napsal:

Our repo is around 450GB. With version 1.0 or older, I've found it's best to create a small index (with a subset of files) and/or remove the files that cause the memory errors until you can get a full index to create and then slowly take out the ignore files options.

— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/oracle/opengrok/issues/2407#issuecomment-430325036, or mute the thread https://github.com/notifications/unsubscribe-auth/ACzGDLvCN7c6QR9gMAcXtNoAzo-ahCRAks5ulhbVgaJpZM4XVpkb .

Shooter3k commented 5 years ago

What are these files ? Dne út 16. 10. 2018 19:26 uživatel [notifications@github.com](mailto:notifications@github.com) napsal: Our repo is around 450GB. With version 1.0 or older, I've found it's best to create a small index (with a subset of files) and/or remove the files that cause the memory errors until you can get a full index to create and then slowly take out the ignore files options. — You are receiving this because you commented. Reply to this email directly, view it on GitHub <#2407 (comment)>, or mute the thread https://github.com/notifications/unsubscribe-auth/ACzGDLvCN7c6QR9gMAcXtNoAzo-ahCRAks5ulhbVgaJpZM4XVpkb .

Corporate code files from multiple code repositories.

vladak commented 5 years ago

I mean - is there something special about them ?

Dne út 16. 10. 2018 19:42 uživatel Jake VanEck notifications@github.com napsal:

What are these files ? Dne út 16. 10. 2018 19:26 uživatel Jake VanEck < notifications@github.com> napsal: … <#m-997926693750626801> Our repo is around 450GB. With version 1.0 or older, I've found it's best to create a small index (with a subset of files) and/or remove the files that cause the memory errors until you can get a full index to create and then slowly take out the ignore files options. — You are receiving this because you commented. Reply to this email directly, view it on GitHub <#2407 (comment) https://github.com/oracle/opengrok/issues/2407#issuecomment-430325036>, or mute the thread https://github.com/notifications/unsubscribe-auth/ACzGDLvCN7c6QR9gMAcXtNoAzo-ahCRAks5ulhbVgaJpZM4XVpkb .

Corporate code files from multiple code repositories.

— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/oracle/opengrok/issues/2407#issuecomment-430330204, or mute the thread https://github.com/notifications/unsubscribe-auth/ACzGDKOquWvZSFkqfV2CSVeInu-VwFPZks5ulhpzgaJpZM4XVpkb .

Shooter3k commented 5 years ago

I wouldn't consider them special at all. Just many many years and lines of code, plus random other files included in any corporate code repository.

I was just making the point that 19GB of source files isn't very large in comparison and opengrok can handle it (with only minor tweaks, if any). The newer versions (1.1+) are getting even better. In fact, I've indexed our companies network drives (for fun) that were multiple TB's just to see if it would work (and it did).

My only question for this person would be what are the size and types of files that he's indexing because I have noticed issues with large XML files and some other types (like vdx) but I feel like that's a different conversation.

I mean - is there something special about them ? Dne út 16. 10. 2018 19:42 uživatel [notifications@github.com](mailto:notifications@github.com) napsal: What are these files ? Dne út 16. 10. 2018 19:26 uživatel < @.***> napsal: … <#m-997926693750626801> Our repo is around 450GB. With version 1.0 or older, I've found it's best to create a small index (with a subset of files) and/or remove the files that cause the memory errors until you can get a full index to create and then slowly take out the ignore files options. — You are receiving this because you commented. Reply to this email directly, view it on GitHub <#2407 (comment) <#2407 (comment)>>, or mute the thread https://github.com/notifications/unsubscribe-auth/ACzGDLvCN7c6QR9gMAcXtNoAzo-ahCRAks5ulhbVgaJpZM4XVpkb . Corporate code files from multiple code repositories. — You are receiving this because you commented. Reply to this email directly, view it on GitHub <#2407 (comment)>, or mute the thread https://github.com/notifications/unsubscribe-auth/ACzGDKOquWvZSFkqfV2CSVeInu-VwFPZks5ulhpzgaJpZM4XVpkb .

Lokendra-Saini commented 5 years ago

@vladak - Sorry for replying late. The indexing got completed with 1.0 only. The settings were _JAVA_OPTS="${JAVAOPTS:--Xmx16384m -d64 -server}", and I was allowed to use 32GB of RAM in the machine. So my mistake was to allow the java heap to grow upto 32GB and my machine only had 32GB of RAM. One should mention the max Java heap size, less than the system's RAM.

Lokendra-Saini commented 5 years ago

@jvaneck- Have many questions for you, hope you can find out some time for me :-).

  1. What were your changes to index 450 GB repo?
  2. What was the configuration of the system?
  3. How long it took to index the data?
Shooter3k commented 5 years ago

On what version do you want me to answer the questions based from?

Have many questions for you, hope you can find out some time for me :-).

1. What were your changes to index 450 GB repo?

2. What was the configuration of the system?

3. How long it took to index the data?
Lokendra-Saini commented 5 years ago

It would be great if the version is 1.0.

Shooter3k commented 5 years ago

Version 1.0 is better than 0.x versions but not as good as 1.1rc's.

My logs are showing that 1.1rc38 indexed 3ish TB in 2 days 6 hours. Although I'm not 100% sure if that is correct or not.

1.0 and especially 0.x versions usually took at least a month (usually multiple months) or more to index 250GB because of the number of times it would stop/fail and I would have to start it again.

Ironically, I've moved a lot of this to run 100% off the network so that I can have multiple machines working on the differences processes. I realize this is a very poor implementation and it was mostly done as a stepping stone due to the amount of time I spend on this (I do it in my spare time at work).

I have played with many different options but the thing I have found to work the best is to not try to index everything at once. Getting the index to create with a small subset and then adding to the index incrementally seems to work best. I also pay close attention to the logs and ignore files that cause it to error out.

Here is an example of one of my initial index create scripts (for windows). The specs of the windows machine are 16GB of ram, Intel Xeon E5-2690 @ 2.90ghz (3 processors). I typically try to run absolutely nothing but 1 index for the initial "create". I then run this script multiple times until it completes without adding more files. Then, I add more files to the source path and repeat the process.

It's also worth pointing out that my corporate lan which I'm running this off of (when working well) can transfer up to 1Gbps and will often hit that when transferring large files

java -Djava.util.logging.config.file=Y:\grok-g\logging.properties -jar Y:\grok-g\bin\opengrok.jar ^
-S ^
-s G:\ ^
-d Y:\grok-g\data ^
-W Y:\grok-g\etc\configuration.xml ^
-U localhost:2430 ^
-G ^
-T 16 ^
-z 1 ^
-i *test* ^
-r on ^
-c Y:\grok-g\bin\ctags.exe ^
-P ^
-O on ^
-a on ^
-w /g
tarzanek commented 5 years ago

we tend to ignore problematic files(there are ignore options for indexer on file or directory regexps), but big files or files with long tokens used to be a problem in old opengrok versions, we tried to improve analyzers to chew on anything I think xml analyser is not included in https://github.com/oracle/opengrok/blob/master/opengrok-indexer/pom.xml#L262 and I am seriously tempted to include it there and the OOM issues should stop, since we will limit the tokens to 32k (I think I also had more language analyzers there, perhaps we can do it for all analyzers, the false positives hit by this limit will be a sacrifice that we can live with, after all solr/lucene do it by default anyways)

vladak commented 5 years ago

+1 for including the XML analyzer

Dne čt 18. 10. 2018 10:43 uživatel Lubos Kosco notifications@github.com napsal:

we tend to ignore problematic files(there are ignore options for indexer on file or directory regexps), but big files or files with long tokens used to be a problem in old opengrok versions, we tried to improve analyzers to chew on anything I think xml analyser is not included in https://github.com/oracle/opengrok/blob/master/opengrok-indexer/pom.xml#L262 and I am seriously tempted to include it there and the OOM issues should stop, since we will limit the tokens to 64k (I think I also had more language analyzers there, perhaps we can do it for all analyzers, the false positives hit by this limit will be a sacrifice that we can live with, after all solr/lucene do it by default anyways)

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/oracle/opengrok/issues/2407#issuecomment-430925952, or mute the thread https://github.com/notifications/unsubscribe-auth/ACzGDBiEVUSsRJuofzXmQ-414h1d8aBNks5umD8VgaJpZM4XVpkb .

Shooter3k commented 5 years ago

+1 for doing anything/something to fix the OOM issues. Even if the indexer (did something like) automatically adding files that caused it to fail to an auto 'ignore list' that would be nice.