oracle / opengrok

OpenGrok is a fast and usable source code search and cross reference engine, written in Java
http://oracle.github.io/opengrok/
Other
4.38k stars 752 forks source link

~48 MB XHTML file causes XMLAnalyzer to explode #907

Open vladak opened 9 years ago

vladak commented 9 years ago

Running indexed on Solaris Userland consolidation ended prematurely with:

2015-03-09 12:15:39.655+0100 INFO t24 DefaultIndexChangedListener.fileRemove: Remove file:/userland-default-prepped/components/imagemagick/ImageMagick-6.8.3/magick/module.c
2015-03-09 12:15:39.666+0100 INFO t24 DefaultIndexChangedListener.fileRemove: Remove file:/userland-default-prepped/components/imagemagick/ImageMagick-6.8.3/magick/module.c
2015-03-09 12:15:39.676+0100 INFO t24 DefaultIndexChangedListener.fileRemove: Remove file:/userland-default-prepped/components/imagemagick/ImageMagick-6.8.3/magick/module.c
2015-03-09 12:15:39.738+0100 INFO t24 DefaultIndexChangedListener.fileAdd: Add: /userland-default-prepped/components/imagemagick/ImageMagick-6.8.3/www/api/colorspace.html (XMLAnalyzer)
2015-03-09 12:15:47.038+0100 SEVERE t24 Indexer$2.run: An error occured while updating index
java.lang.OutOfMemoryError: Requested array size exceeds VM limit
        at java.util.Arrays.copyOf(Arrays.java:2367)
        at java.lang.AbstractStringBuilder.expandCapacity(AbstractStringBuilder.java:130)
        at java.lang.StringBuilder.append(StringBuilder.java:143)
        at java.lang.StringBuilder.append(StringBuilder.java:183)
        at org.opensolaris.opengrok.web.Util.breadcrumbPath(Util.java:278)
        at org.opensolaris.opengrok.web.Util.breadcrumbPath(Util.java:220)
        at org.opensolaris.opengrok.analysis.plain.XMLXref.yylex(XMLXref.java:723)
        at org.opensolaris.opengrok.analysis.JFlexXref.write(JFlexXref.java:229)
        at org.opensolaris.opengrok.analysis.plain.XMLAnalyzer.writeXref(XMLAnalyzer.java:74)
        at org.opensolaris.opengrok.analysis.plain.XMLAnalyzer.analyze(XMLAnalyzer.java:60)
        at org.opensolaris.opengrok.analysis.AnalyzerGuru.populateDocument(AnalyzerGuru.java:307)
        at org.opensolaris.opengrok.index.IndexDatabase.addFile(IndexDatabase.java:606)
        at org.opensolaris.opengrok.index.IndexDatabase.indexDown(IndexDatabase.java:870)
        at org.opensolaris.opengrok.index.IndexDatabase.indexDown(IndexDatabase.java:835)
        at org.opensolaris.opengrok.index.IndexDatabase.indexDown(IndexDatabase.java:835)
        at org.opensolaris.opengrok.index.IndexDatabase.indexDown(IndexDatabase.java:835)
        at org.opensolaris.opengrok.index.IndexDatabase.indexDown(IndexDatabase.java:835)
        at org.opensolaris.opengrok.index.IndexDatabase.indexDown(IndexDatabase.java:835)
        at org.opensolaris.opengrok.index.IndexDatabase.update(IndexDatabase.java:383)
        at org.opensolaris.opengrok.index.Indexer$2.run(Indexer.java:846)
        at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:471)
        at java.util.concurrent.FutureTask.run(FutureTask.java:262)
        at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
        at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
        at java.lang.Thread.run(Thread.java:745)
2015-03-09 12:15:47.039+0100 INFO t1 Statistics.report: Done indexing data of all repositories (took 0:02:06)
2015-03-09 12:15:47.039+0100 INFO t1 Statistics.report: Total time: 0:02:07
2015-03-09 12:15:47.330+0100 INFO t1 Statistics.report: Final Memory: 6M/5,400M

The file in question is 48MB XHTML file which looks like this:

0000000   <   ?   x   m   l       v   e   r   s   i   o   n   =   "   1
0000020   .   0   "       e   n   c   o   d   i   n   g   =   "   U   T
0000040   F   -   8   "   ?   >  \n   <   !   D   O   C   T   Y   P   E
0000060       h   t   m   l       P   U   B   L   I   C       "   -   /
0000100   /   W   3   C   /   /   D   T   D       X   H   T   M   L    
0000120   1   .   0       S   t   r   i   c   t   /   /   E   N   "    
0000140   "   h   t   t   p   :   /   /   w   w   w   .   w   3   .   o
0000160   r   g   /   T   R   /   x   h   t   m   l   1   /   D   T   D
0000200   /   x   h   t   m   l   1   -   s   t   r   i   c   t   .   d
0000220   t   d   "   >  \n   <   h   t   m   l       x   m   l   n   s
0000240   =   "   h   t   t   p   :   /   /   w   w   w   .   w   3   .
0000260   o   r   g   /   1   9   9   9   /   x   h   t   m   l   "    
0000300   x   m   l   :   l   a   n   g   =   "   e   n   "       l   a
0000320   n   g   =   "   e   n   "       d   i   r   =   "   l   t   r
0000340   "   >  \n   <   h   e   a   d   >  \n           <   s   t   y
0000360   l   e       t   y   p   e   =   "   t   e   x   t   /   c   s
0000400   s   "       m   e   d   i   a   =   "   s   c   r   e   e   n
0000420   ,   p   r   o   j   e   c   t   i   o   n   "   >   <   !   -
0000440   -  \n                   @   i   m   p   o   r   t       u   r
0000460   l   (   "   .   .   /   .   .   /   .   .   /   .   .   /   .
0000500   .   /   .   .   /   .   .   /   .   .   /   .   .   /   .   .
0000520   /   .   .   /   .   .   /   .   .   /   .   .   /   .   .   /

...

300021220   .   /   .   .   /   .   .   /   .   .   /   .   .   /   .   .
300021240   /   .   .   /   .   .   /   .   .   /   w   w   w   /   m   a
300021260   i   l   i   n   g   -   l   i   s   t   .   h   t   m   l   "
300021300   >   M   a   i   l   i   n   g       L   i   s   t   s   <   /
300021320   a   >       &   b   u   l   l   ;  \n                        
300021340   <   a       h   r   e   f   =   "   h   t   t   p   :   /   /
300021360   c   a   f   e   .   i   m   a   g   e   m   a   g   i   c   k
300021400   .   o   r   g   "       t   a   r   g   e   t   =   "   5   0
300021420   4   8   2   3   4   7   0   "   >   C   a   f   e   <   /   a
300021440   >       &   b   u   l   l   ;  \n                   <   a    
300021460   h   r   e   f   =   "   h   t   t   p   :   /   /   s   t   u
300021500   d   i   o   .   w   e   b   b   y   l   a   n   d   .   c   o
300021520   m   /   I   m   a   g   e   M   a   g   i   c   k   /   M   a
300021540   g   i   c   k   S   t   u   d   i   o   /   s   c   r   i   p
300021560   t   s   /   M   a   g   i   c   k   S   t   u   d   i   o   .
300021600   c   g   i   "       t   a   r   g   e   t   =   "   9   3   1
300021620   0   9   7   5   3   1   "   >   S   t   u   d   i   o   <   /
300021640   a   >  \n                   <   /   s   p   a   n   >  \n    
300021660       <   /   d   i   v   >  \n           <   d   i   v       i
300021700   d   =   "   f   o   o   t   e   r   "   >  \n                
300021720   <   s   p   a   n       i   d   =   "   f   o   o   t   e   r
300021740   -   w   e   s   t   "   >   &   c   o   p   y   ;       1   9
300021760   9   9   -   2   0   1   0       I   m   a   g   e   M   a   g
300022000   i   c   k       S   t   u   d   i   o       L   L   C   <   /
300022020   s   p   a   n   >  \n           <   /   d   i   v   >  \n    
300022040       <   d   i   v       s   t   y   l   e   =   "   c   l   e
300022060   a   r   :       b   o   t   h   ;       m   a   r   g   i   n
300022100   :       0   ;       w   i   d   t   h   :       1   0   0   %
300022120   ;       "   >   <   /   d   i   v   >  \n   <   /   b   o   d
300022140   y   >  \n   <   /   h   t   m   l   >  \n
vladak commented 9 years ago

According to @tarzanek the file itself is invalid (http://www.imagemagick.org/discourse-server/viewtopic.php?t=22851 says Links in www/api Documentation bloated with 16384 "../") however the analyzer should not be that sensitive.

vladak commented 9 years ago

The workaround is to add this file to the ignored list (the -i option or IGNORE_PATTERNS environment variable used by the OpenGrok script).

vladak commented 9 years ago

For completeness the indexer was running with:

ncpus=`/sbin/psrinfo | grep on-line | wc -l`
nthr=`expr $ncpus \* 2`
# more efficient in case multiple projects wait on renamed files processing
# see https://github.com/OpenGrok/OpenGrok/pull/752 for details
tunables="-Dorg.opensolaris.opengrok.history.NumCacheRenamedThreads=$nthr"
tunables="$tunables -Dorg.opensolaris.opengrok.history.RenamedHandlingEnabled=1"
# for userland prepped repositories
tunables="$tunables -Dorg.opensolaris.opengrok.history.noFetchWhenNotInCache=1"

# https://github.com/OpenGrok/OpenGrok/issues/718
JAVA_OPTS="-d64 -XX:-UseGCOverheadLimit -Xmx8192m -server $tunables"

# speed up indexing by tuning Lucene memory buffer size
OPENGROK_FLUSH_RAM_BUFFER_SIZE="-m 256"
tarzanek commented 9 years ago

so this is actually a compact bug and reason why we don't see this in other analyzers is: JAVA:

File = [a-zA-Z]{FNameChar}\* "." ("java"|"properties"|"props"|"xml"|"conf"|"txt"|"htm"|"html"|"ini"|"jnlp"|"jad"|"diff"|"patch")
Path = "/"? [a-zA-Z]{FNameChar}\* ("/" [a-zA-Z]{FNameChar}*[a-zA-Z0-9])+

XML :

File = {FNameChar}+ "." ([a-zA-Z]+) {FNameChar}*
Path = "/"? {FNameChar}+ ("/" {FNameChar}+)+[a-zA-Z0-9]

SH:

Path = "/"? [a-zA-Z]{FNameChar}\* ("/" [a-zA-Z]{FNameChar}*)+[a-zA-Z0-9]

so all analyzers have broken path detection if it would work, we could get better links where appropriate

Test on paths e.g. :

../../../java.bah
../ffaa/foobar
../foobar.i

funny enough, we would be then able to see why #806 was not fixed fully, since that code never worked and when compact is true, output would be eaten if prefixed by "../"

so breadcrumbPath needs some serious fixes and tests with deeper paths (no matter whether true or false is used ... )

tarzanek commented 9 years ago

Also just to be on safe side - Integer.MAX_VALUE - 5 is the max size of an array, so we should probably check input data if it won't exceed this input (though after the breadcrumbPath(Util.java:278) is fixed I doubt we will hit the limit ever (or ... once OSes will support LOOOONG paths with zilions of subdirs) )

tarzanek commented 7 years ago

I guess such files will have to be ignored for now