oracle / opengrok

OpenGrok is a fast and usable source code search and cross reference engine, written in Java
http://oracle.github.io/opengrok/
Other
4.29k stars 739 forks source link

OpenGrok Indexer. FileNotFoundException while indexing directories #4528

Closed tarangchikhalia closed 4 months ago

tarangchikhalia commented 5 months ago

The OpenGrok indexer is throwing FileNotFoundException on some directories while indexing.

Jan 10, 2024 9:05:41 AM org.opengrok.indexer.index.IndexDatabase lambda$indexParallel$4
WARNING: ERROR addFile(): '/var/opt/opengrok/<dir_path>'
java.io.FileNotFoundException: /var/opt/opengrok/<dir_path> (Is a directory)
        at java.base/java.io.FileInputStream.open0(Native Method)
        at java.base/java.io.FileInputStream.open(FileInputStream.java:219)
        at java.base/java.io.FileInputStream.<init>(FileInputStream.java:157)
        at org.opengrok.indexer.index.IndexDatabase.getAnalyzerFor(IndexDatabase.java:1217)
        at org.opengrok.indexer.index.IndexDatabase.addFile(IndexDatabase.java:1129)
        at org.opengrok.indexer.index.IndexDatabase.lambda$indexParallel$4(IndexDatabase.java:1781)
        at java.base/java.util.stream.Collectors.lambda$groupingByConcurrent$59(Collectors.java:1304)
        at java.base/java.util.stream.ReferencePipeline.lambda$collect$1(ReferencePipeline.java:575)
        at java.base/java.util.stream.ForEachOps$ForEachOp$OfRef.accept(ForEachOps.java:183)
        at java.base/java.util.ArrayList$ArrayListSpliterator.forEachRemaining(ArrayList.java:1655)
        at java.base/java.util.stream.AbstractPipeline.copyInto(AbstractPipeline.java:484)
        at java.base/java.util.stream.ForEachOps$ForEachTask.compute(ForEachOps.java:290)
        at java.base/java.util.concurrent.CountedCompleter.exec(CountedCompleter.java:746)
        at java.base/java.util.concurrent.ForkJoinTask.doExec(ForkJoinTask.java:290)
        at java.base/java.util.concurrent.ForkJoinPool$WorkQueue.topLevelExec(ForkJoinPool.java:1020)
        at java.base/java.util.concurrent.ForkJoinPool.scan(ForkJoinPool.java:1656)
        at java.base/java.util.concurrent.ForkJoinPool.runWorker(ForkJoinPool.java:1594)
        at java.base/java.util.concurrent.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:183)

OpenGrok version: 1.12.12 Tomcat: 10.1.x JDK: 11 OS: Oracle Linux 8.8

vladak commented 5 months ago

How is the indexer run ? Was this initial or incremental reindex ? Is the directory in question part of some repository ?

tarangchikhalia commented 5 months ago

This is an incremental reindex. The directory is part of a repository which is copied from the remote server to the opengrok server (No SCM) but I have seen this error in many git repositories.

vladak commented 5 months ago

Can you raise indexer log level to FINER or higher and post the logs around the log entries that start with Starting file collection and such for a case which encounters the directory problem ? This line and any subsequent lines that contain DefaultIndexChangedListener would help.

tarangchikhalia commented 5 months ago

Here are the logs with FINEST settings.

Jan 11, 2024 3:21:46 PM org.opengrok.indexer.index.IndexDatabase logIgnoredUid                                                                                                                                            [373/1934]
FINEST: ignoring deleted document for '/<project>/version.json' at 20240106111117766                                                                                                                          
Jan 11, 2024 3:21:46 PM org.opengrok.indexer.index.DefaultIndexChangedListener fileRemove                                                                                                                                           
FINE: Remove: '/<project>/version.json'                                                                                                                                                                       
Jan 11, 2024 3:21:46 PM org.opengrok.indexer.index.DefaultIndexChangedListener fileRemoved                                                                                                                                          
FINER: Removed: '/<project>/version.json'                                                                                                                                                                     
Jan 11, 2024 3:21:46 PM org.opengrok.indexer.util.Statistics logIt                                                                                                                                                                  
INFO: Done file collection for directory '/<project>' (took 15 ms)                                                                                                                                            
Jan 11, 2024 3:21:46 PM org.opengrok.indexer.index.IndexDatabase update                                                                                                                                                             
INFO: Starting indexing of directory '/<project>'                                                                                                                                                             
Jan 11, 2024 3:21:46 PM org.opengrok.indexer.index.IndexDatabase lambda$indexParallel$4                                                                                                                                             
WARNING: ERROR addFile(): '/var/opt/opengrok/<dir_path>'                                                                                                                               
java.io.FileNotFoundException: /var/opt/opengrok/<dir_path> (Is a directory)                                                                                                           
        at java.base/java.io.FileInputStream.open0(Native Method)                                                                                                                                                                   
        at java.base/java.io.FileInputStream.open(FileInputStream.java:219)                                                                                                                                                         
        at java.base/java.io.FileInputStream.<init>(FileInputStream.java:157)                                                                                                                                                       
        at org.opengrok.indexer.index.IndexDatabase.getAnalyzerFor(IndexDatabase.java:1217)                                                                                                                                         
        at org.opengrok.indexer.index.IndexDatabase.addFile(IndexDatabase.java:1129)                                                                                                                                                
        at org.opengrok.indexer.index.IndexDatabase.lambda$indexParallel$4(IndexDatabase.java:1781)                                                                                                                                 
        at java.base/java.util.stream.Collectors.lambda$groupingByConcurrent$59(Collectors.java:1304)                                                                                                                               
        at java.base/java.util.stream.ReferencePipeline.lambda$collect$1(ReferencePipeline.java:575)                                                                                                                                
        at java.base/java.util.stream.ForEachOps$ForEachOp$OfRef.accept(ForEachOps.java:183)                                                                                                                                        
        at java.base/java.util.ArrayList$ArrayListSpliterator.forEachRemaining(ArrayList.java:1655)                                                                                                                                 
        at java.base/java.util.stream.AbstractPipeline.copyInto(AbstractPipeline.java:484)                                                                                                                                          
        at java.base/java.util.stream.ForEachOps$ForEachTask.compute(ForEachOps.java:290)                                                                                                                                           
        at java.base/java.util.concurrent.CountedCompleter.exec(CountedCompleter.java:746)                                                                                                                                          
        at java.base/java.util.concurrent.ForkJoinTask.doExec(ForkJoinTask.java:290)                                                                                                                                                
        at java.base/java.util.concurrent.ForkJoinTask.doInvoke(ForkJoinTask.java:408)                                                                                                                                              
        at java.base/java.util.concurrent.ForkJoinTask.invoke(ForkJoinTask.java:736)                                                                                                                                                
        at java.base/java.util.stream.ForEachOps$ForEachOp.evaluateParallel(ForEachOps.java:159)                                                                                                                                    
        at java.base/java.util.stream.ForEachOps$ForEachOp$OfRef.evaluateParallel(ForEachOps.java:173)                                                                                                                              
        at java.base/java.util.stream.AbstractPipeline.evaluate(AbstractPipeline.java:233)
        at java.base/java.util.stream.ReferencePipeline.forEach(ReferencePipeline.java:497)
        at java.base/java.util.stream.ReferencePipeline$Head.forEach(ReferencePipeline.java:661)
        at java.base/java.util.stream.ReferencePipeline.collect(ReferencePipeline.java:575)
        at org.opengrok.indexer.index.IndexDatabase.lambda$indexParallel$5(IndexDatabase.java:1770)
        at java.base/java.util.concurrent.ForkJoinTask$AdaptedCallable.exec(ForkJoinTask.java:1448)
        at java.base/java.util.concurrent.ForkJoinTask.doExec(ForkJoinTask.java:290)
        at java.base/java.util.concurrent.ForkJoinPool$WorkQueue.topLevelExec(ForkJoinPool.java:1020)
        at java.base/java.util.concurrent.ForkJoinPool.scan(ForkJoinPool.java:1656)
        at java.base/java.util.concurrent.ForkJoinPool.runWorker(ForkJoinPool.java:1594)
        at java.base/java.util.concurrent.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:183)

Jan 11, 2024 3:21:46 PM org.opengrok.indexer.index.IndexDatabase lambda$indexParallel$4
vladak commented 5 months ago

Can you also provide the line that contains Starting file collection ?

vladak commented 5 months ago

I went through the related code in IndexDatabase and for the initial reindex I don't see a way there can be an entry in the IndexDownArgs that would correspond to a directory. The indexDown() recursive function that is executed when reindexing from scratch (or when history based reindex is off for some reason) traverses the directory tree like this: https://github.com/oracle/opengrok/blob/b2383942c7ea3e938f62f66521a19ce61293b0a5/opengrok-indexer/src/main/java/org/opengrok/indexer/index/IndexDatabase.java#L1629-L1641

The accept() call detects any allowed symlinks. The isDirectory() follows symlinks so even if the file is forbidden symlink, it will be still processed in the else branch as a directory, i.e. the indexDown() will recursively descend into that directory. The IndexDownArgs is modified (within this code path) only in the processFile() method and this method is always called for non-directory entries.

The IndexDownArgs is further modified in processTrailingTerms() from within update() however that only happens for pre-existing index documents.

The history based reindex (which is always non-initial) that is done in indexDownUsingHistory() is different story. There the accept() call that identifies allowed symlinks is not used so it could happen that processFileIncremental() which is the work horse for this indexing mode actually adds an IndexDownArgs entry that is a directory. For Git specifically, I don't think there is a way for the Git file tree traversal could contain directories (since in Git a directory can be added to the Git index only if non-empty) however if the entry is a symlink pointing to a directory, that is possible.

That's why I asked about the Starting file collection log entry so that I can see for which indexing mode this happens.

tarangchikhalia commented 5 months ago

Sorry for the delay. The project that was encountering this issue isn't showing it now. I am trying to reproduce it in a test environment.

vladak commented 5 months ago

Sorry for the delay. The project that was encountering this issue isn't showing it now. I am trying to reproduce it in a test environment.

It definitely depends on the changes done since the last reindex. For history based reindex that would be the file trees in the newly added changesets.