oracle / opengrok

OpenGrok is a fast and usable source code search and cross reference engine, written in Java
http://oracle.github.io/opengrok/
Other
4.29k stars 739 forks source link

cannot remove suggester data for extended period of time due to pending rebuilds #4538

Closed vladak closed 4 months ago

vladak commented 5 months ago

When a project is being deleted (with 1.12.28) via API call and there is bunch of suggester rebuilds queued and in progress, resulting in the project delete operation taking very long time to complete:

"OpenGrok-api_task-projects-37725" #37725 prio=5 os_prio=64 cpu=703154.41ms elapsed=3178188.44s tid=0x000000000269b000 nid=0x9857 waiting for monitor entry  [0x00007ff312629000]
   java.lang.Thread.State: BLOCKED (on object monitor)
        at org.opengrok.suggest.Suggester.remove(Suggester.java:391)
        - waiting to lock <0x00007ff609bd7db8> (a java.lang.Object)
        at org.opengrok.web.api.v1.suggester.provider.service.impl.SuggesterServiceImpl.delete(SuggesterServiceImpl.java:233)
        at org.opengrok.web.api.v1.controller.ProjectsController.deleteProjectDataWorkHorse(ProjectsController.java:267)
        at org.opengrok.web.api.v1.controller.ProjectsController.deleteProjectWorkHorse(ProjectsController.java:197)
        at org.opengrok.web.api.v1.controller.ProjectsController.lambda$deleteProject$3(ProjectsController.java:189)
        at org.opengrok.web.api.v1.controller.ProjectsController$$Lambda$11076/0x00007feb1383b4b0.call(Unknown Source)
        at org.opengrok.web.api.ApiTask.lambda$getCallable$0(ApiTask.java:202)
        at org.opengrok.web.api.ApiTask$$Lambda$4215/0x00007ff297154458.call(Unknown Source)
        at java.util.concurrent.FutureTask.run(java.base@11.0.7-internal/FutureTask.java:264)
        at java.util.concurrent.ThreadPoolExecutor.runWorker(java.base@11.0.7-internal/ThreadPoolExecutor.java:1128)
        at java.util.concurrent.ThreadPoolExecutor$Worker.run(java.base@11.0.7-internal/ThreadPoolExecutor.java:628)
        at java.lang.Thread.run(java.base@11.0.7-internal/Thread.java:834)

the thread holding the lock is probably this one:

"ForkJoinPool-13877-worker-19" #372875 daemon prio=5 os_prio=64 cpu=91891.56ms elapsed=92.65s tid=0x0000000010777000 nid=0x5c1f6 runnable  [0x00007ff26eefc000]
   java.lang.Thread.State: RUNNABLE
        at org.apache.lucene.codecs.lucene90.blocktree.SegmentTermsEnum.getFrame(SegmentTermsEnum.java:213)
        at org.apache.lucene.codecs.lucene90.blocktree.SegmentTermsEnum.pushFrame(SegmentTermsEnum.java:241)
        at org.apache.lucene.codecs.lucene90.blocktree.SegmentTermsEnum.seekExact(SegmentTermsEnum.java:494)
        at org.apache.lucene.index.LeafReader.docFreq(LeafReader.java:81)
        at org.apache.lucene.index.BaseCompositeReader.docFreq(BaseCompositeReader.java:196)
        at org.opengrok.suggest.SuggesterUtils.computeNormalizedDocumentFrequency(SuggesterUtils.java:108)
        at org.opengrok.suggest.SuggesterUtils.computeScore(SuggesterUtils.java:97)
        at org.opengrok.suggest.SuggesterProjectData$WFSTInputIterator.weight(SuggesterProjectData.java:625)
        at org.apache.lucene.search.suggest.SortedInputIterator.sort(SortedInputIterator.java:190)
        at org.apache.lucene.search.suggest.SortedInputIterator.<init>(SortedInputIterator.java:80)
        at org.apache.lucene.search.suggest.SortedInputIterator.<init>(SortedInputIterator.java:64)
        at org.apache.lucene.search.suggest.fst.WFSTCompletionLookup$WFSTInputIterator.<init>(WFSTCompletionLookup.java:269)
        at org.apache.lucene.search.suggest.fst.WFSTCompletionLookup.build(WFSTCompletionLookup.java:107)
        at org.opengrok.suggest.SuggesterProjectData.build(SuggesterProjectData.java:268)
        at org.opengrok.suggest.SuggesterProjectData.build(SuggesterProjectData.java:255)
        at org.opengrok.suggest.SuggesterProjectData.rebuild(SuggesterProjectData.java:240)
        at org.opengrok.suggest.Suggester.lambda$getRebuildRunnable$2(Suggester.java:370)
        at org.opengrok.suggest.Suggester$$Lambda$4234/0x00007fff878fc500.run(Unknown Source)
        at java.util.concurrent.ForkJoinTask$AdaptedRunnableAction.exec(java.base@11.0.7-internal/ForkJoinTask.java:1407)
        at java.util.concurrent.ForkJoinTask.doExec(java.base@11.0.7-internal/ForkJoinTask.java:290)
        at java.util.concurrent.ForkJoinPool$WorkQueue.topLevelExec(java.base@11.0.7-internal/ForkJoinPool.java:1020)
        at java.util.concurrent.ForkJoinPool.scan(java.base@11.0.7-internal/ForkJoinPool.java:1656)
        at java.util.concurrent.ForkJoinPool.runWorker(java.base@11.0.7-internal/ForkJoinPool.java:1594)
        at java.util.concurrent.ForkJoinWorkerThread.run(java.base@11.0.7-internal/ForkJoinWorkerThread.java:177)

Also, there is bunch of threads waiting with stacks like this:

"ForkJoinPool.commonPool-worker-39" #69874 daemon prio=5 os_prio=64 cpu=364.29ms elapsed=2889951.75s tid=0x000000000a95e800 nid=0x118f4 waiting for monitor entry  [0x00007ff27c292000]
   java.lang.Thread.State: BLOCKED (on object monitor)
        at org.opengrok.suggest.Suggester.rebuild(Suggester.java:307)
        - waiting to lock <0x00007ff609bd7db8> (a java.lang.Object)
        at org.opengrok.web.api.v1.suggester.provider.service.impl.SuggesterServiceImpl.rebuild(SuggesterServiceImpl.java:213)
        at org.opengrok.web.api.v1.controller.ProjectsController.lambda$markIndexed$7(ProjectsController.java:370)
        at org.opengrok.web.api.v1.controller.ProjectsController$$Lambda$4233/0x00007fff878ff0b0.run(Unknown Source)
        at java.util.concurrent.CompletableFuture$AsyncRun.run(java.base@11.0.7-internal/CompletableFuture.java:1736)
        at java.util.concurrent.CompletableFuture$AsyncRun.exec(java.base@11.0.7-internal/CompletableFuture.java:1728)
        at java.util.concurrent.ForkJoinTask.doExec(java.base@11.0.7-internal/ForkJoinTask.java:290)
        at java.util.concurrent.ForkJoinPool$WorkQueue.topLevelExec(java.base@11.0.7-internal/ForkJoinPool.java:1020)
        at java.util.concurrent.ForkJoinPool.scan(java.base@11.0.7-internal/ForkJoinPool.java:1656)
        at java.util.concurrent.ForkJoinPool.runWorker(java.base@11.0.7-internal/ForkJoinPool.java:1594)
        at java.util.concurrent.ForkJoinWorkerThread.run(java.base@11.0.7-internal/ForkJoinWorkerThread.java:177)

Source code wise the waiting thread is trying to acquire the per-class lock in Suggester#remove(): https://github.com/oracle/opengrok/blob/505f4300fc8be9540017e1c71e42daaf475c5f4e/suggester/src/main/java/org/opengrok/suggest/Suggester.java#L390 while it is being held by a thread in Suggester#rebuild(): https://github.com/oracle/opengrok/blob/505f4300fc8be9540017e1c71e42daaf475c5f4e/suggester/src/main/java/org/opengrok/suggest/Suggester.java#L307

It is not clear to me why the global lock is needed considering the projectData is ConcurrentHashMap. Also, each per project suggester data class SuggesterProjectData has its own lock that is used to enclose the bodies of rebuild() and remove().

Lastly, it seems that the scheduler allows the queued threads to grab the lock in the order they attempted to acquire it, so the ProjectsController thread that came late to the party has to wait for them all to complete, (even though its operation could be quicker than the rebuilds) leading to the observed behavior. In my case there was circa 30 scheduled rebuilds waiting, each takes between 5-8 minutes to complete, so the deletion request has to wait for more than 2 hours (one of the reasons for #4536).