Issue with indexing large .h file

Lokendra-Saini commented 5 years ago

Hello all,

I am trying to index a single .h file of around 1 GB with Opengrok-1.0. The heap uses are reaching upto 26 GB for that file only.

Following is the relevant information Java - jdk1.8.0_51 Ant - apache-ant-1.9.0 Ctags - ctags-5.8

My changes to OpenGrok Script " OPENGROK_GENERATE_HISTORY=off OPENGROK_SCAN_REPOS=false OPENGROK_VERBOSE=true OPENGROK_FLUSH_RAM_BUFFER_SIZE="-m 256"

export OPENGROK_GENERATE_HISTORY export OPENGROK_SCAN_REPOS export OPENGROK_VERBOSE export OPENGROK_FLUSH_RAM_BUFFER_SIZE JAVA_OPTS="${JAVA_OPTS:--Xmxmentioned below -d64 -server}" "

I tried multiple things,as mentioned in attached image. The questions which gets raised are

Why 1 single 1 GB file requires around 26 GB of heap space?
Why .txt files got indexed easily, but .h not?
Are increasing the RAM size and Heap size the only workaround?(This will only help till a certain point)
Have someone tried indexing file of size around 1gb?
What should be done if someone has a source of around 1 TB, which contains some large files(e.g. 500MB, 1GB etc)?

Thanks in advance!!

stats

vladak commented 5 years ago

1 GB header file (assuming the contents are in the C language) is very strange indeed.

In general it would be interesting to find out the proportion of source file bulk (and this could mean many things - file size, number of terms, complexity of expressions therein etc.) contributes to heap size. Even more interesting would get that captured in a sequence with stacks.

Could you elaborate what is inside the file ? If you run Universal ctags on the file, how many lines do the resulting tags file contain ?

Also, one thing you can try is reduce the file somehow (halving it, unselecting portions that seem to be complex, etc.) to see if there is particular section of the file that contributes to the majority of the heap growth.

vladak commented 5 years ago

Also, try the latest 1.1 RC, curious if the table above changes with that.

Lokendra-Saini commented 5 years ago

@vladak - The actual source file is of 500mb. Which only have #define statements (around 5,000,000). I created the 1 gb file to get resource requirements.

I ran ctags on that file which generated 5,000,007 lines and the file size was 380 mb.

vladak commented 5 years ago

Okay, the memory growth seems rather disproportional considering these are just defines. Out of curiosity - what these defines look like ? I've seen some long header files but this one is humongous. It would be nuce if you can write a acript that would produce file of similar contents so that we have externally reproducible test case. Anyhow, it will certainly cause some hickups when being rendered in the browser.

Anyway, this should be analyzed with a memory profiler and/or get the JVM dump when it runs out of heap and see what is the structure of the heap with the Memory analyzer tool from Eclipse.

pá 23. 11. 2018 6:52 odesílatel Lokendra-Saini notifications@github.com napsal:

@vladak https://github.com/vladak - The actual source file is of 500mb. Which only have #define statements (around 5,000,000). I created the 1 gb file to get resource requirements.

I ran ctags on that file which generated 5,000,007 lines and the file size was 380 mb.

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/oracle/opengrok/issues/2531#issuecomment-441160694, or mute the thread https://github.com/notifications/unsubscribe-auth/ACzGDGYouxcr6iVNoAFDZIkZKwQm1xYiks5ux402gaJpZM4YvyGY .

Lokendra-Saini commented 5 years ago

Following perl script can produce similar file.

use strict;
use warnings;

#my $count = 10000000;    # creates 1.1GB file
my $count = 5000000;    # creates 524MB file

my $file = shift @ARGV;
die "supply a filename" if ( ! $file );

open(my $FH, '>', $file) or die "$!";

print $FH "#ifndef GENERATE_H\n";
print $FH "#define GENERATE_H\n";

while ( $count ) { 
    print $FH "#define RANDOM_VAR_NAME$count                              0xFFFFFFFF /* =4294967295 */\n";
    $count--;
}

print $FH "#endif // GENERATE_H";
close $FH;

vladak commented 5 years ago

Nice. This could be interesting exploration for someone.

edigaryev commented 5 years ago

Have you tried using the latest Universal Ctags?

See #2364.

oracle / opengrok

Issue with indexing large .h file #2531