oracle / opengrok

OpenGrok is a fast and usable source code search and cross reference engine, written in Java
http://oracle.github.io/opengrok/
Other
4.34k stars 745 forks source link

need a way how to limit the size of files processed by indexer (Bugzilla #19176) #534

Open vladak opened 11 years ago

vladak commented 11 years ago

status NEW severity enhancement in component indexer for --- Reported in version unspecified on platform ANY/Generic Assigned to: Trond Norbye

On 2012-02-15 13:52:01 +0000, Vladimir Kotal wrote:

Recent reindexing with 0.11 revealed that the indexer cannot cope with larger files and just blows up (JAVA_OPTS is default, set to 2 GB):

2012-02-15 14:30:53.572+0100 INFO t15 DefaultIndexChangedListener.fileAdd: Add: /foo.cpio (PlainAnalyzer) 2012-02-15 14:31:43.178+0100 SEVERE t15 IndexDatabase$1.run: Problem updating lucene index database: java.lang.OutOfMemoryError: Java heap space at java.util.Arrays.copyOf(Arrays.java:2882) at org.opensolaris.opengrok.analysis.plain.PlainAnalyzer.analyze(PlainAnalyzer.java:77) at org.opensolaris.opengrok.analysis.TextAnalyzer.analyze(TextAnalyzer.java:60) at org.opensolaris.opengrok.analysis.AnalyzerGuru.getDocument(AnalyzerGuru.java:262) at org.opensolaris.opengrok.index.IndexDatabase.addFile(IndexDatabase.java:584) at org.opensolaris.opengrok.index.IndexDatabase.indexDown(IndexDatabase.java:814) at org.opensolaris.opengrok.index.IndexDatabase.indexDown(IndexDatabase.java:787) at org.opensolaris.opengrok.index.IndexDatabase.indexDown(IndexDatabase.java:787) at org.opensolaris.opengrok.index.IndexDatabase.update(IndexDatabase.java:354) at org.opensolaris.opengrok.index.IndexDatabase$1.run(IndexDatabase.java:158) at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:441) at java.util.concurrent.FutureTask$Sync.innerRun(FutureTask.java:303) at java.util.concurrent.FutureTask.run(FutureTask.java:138) at java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.java:886) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:908) at java.lang.Thread.run(Thread.java:662)

2012-02-15 14:31:43.194+0100 INFO t10 Indexer.sendToConfigHost: Send configuration to: localhost:2424 2012-02-15 14:31:44.488+0100 INFO t10 Indexer.sendToConfigHost: Configuration update routine done, check log output for errors.

$ du -sh /foo.cpio 311M /foo.cpio

There should be an option which would allow us to say that files larger than xy bytes should be ignored by the indexer (similar to the -i option for filenames).

On 2012-02-15 13:54:37 +0000, Vladimir Kotal wrote:

Maybe there should even be some sane default, like 100 MB.

On 2012-02-16 12:26:12 +0000, Knut Anders Hatlen wrote:

The analyzers don't really need to read the entire file into memory, they could also operate on streams. The reason why they do read the file into memory, I think, is to avoid reading every file twice (once to add it to the Lucene indexes, and once to build the xref). I'm not sure how important this optimization is (should run some experiments to see).

vladak commented 9 years ago

Even 100MB is not enough in some cases, e.g. 48 MB XHTML file can cause indexer to run out of heap (issue #907).

vladak commented 9 years ago

Thinking of this some more, maybe the limits should be smarter as some analyzers might be more susceptible to bigger files, i.e. allow limits based on file type (if possible).

vladak commented 7 years ago

What to do with files that were indexed and then grew above the threshold ? Or when the threshold (assuming it will be tunable) is lowered so that previously indexed files will no longer be eligible ? It seems to me that the correct solution would be to delete the information from the index.