Closed ghost closed 5 years ago
I think I noticed this too after redeploy. Still not sure whether this is caused by CPU or I/O load (or both).
Assuming the workload created by the suggester rebuild is indeed CPU bound, we can try setting thread priorities to minimum (https://docs.oracle.com/javase/8/docs/api/java/lang/Thread.html#MIN_PRIORITY) by overriding thread factories used for suggester's thread pools. It is using 2 executor services (http://demo.opengrok.org/xref/opengrok/suggester/src/main/java/org/opengrok/suggest/Suggester.java?r=f96b2fc6):
87 // do NOT use fork join thread pool (work stealing thread pool) because it does not send interrupts upon cancellation
88 private final ExecutorService executorService = Executors.newFixedThreadPool(
89 Runtime.getRuntime().availableProcessors());
...
227 ExecutorService executor = Executors.newWorkStealingPool();
Also, I wonder if this could help given https://stackoverflow.com/a/1617989 , i.e. it might be necessary to yield
if the workload is purely CPU bound.
Another idea would be to limit the size of the pool significantly below availableProcessors()
- maybe make it a tunable in suggester configuration ?
Also, using dedicated thread factory will have the benefit of using more descriptive thread names.
I added new API endpoint to SuggesterController
to be able to trigger rebuild on demand and started a Python script to perform search query sequentially and record the times from the response. Once the rebuild starts running, not even one search request goes through until the rebuild is done. During the rebuild neither CPU or I/O are saturated. There has to be something else which blocks the queries.
Marking as stopper since this is quite serious bug.
Suggester itself holds a readwrite lock in write mode while rebuilding. There must be something in the search queries which does a lookup or something and fights with the readwrite lock.
The trouble is that search reaches to the suggester, trying to grab a lock:
"http-nio-8080-exec-10@8218" daemon prio=5 tid=0x1883 nid=NA waiting
java.lang.Thread.State: WAITING
at sun.misc.Unsafe.park(Unsafe.java:-1)
at java.util.concurrent.locks.LockSupport.park(LockSupport.java:175)
at java.util.concurrent.locks.AbstractQueuedSynchronizer.parkAndCheckInterrupt(AbstractQueuedSynchronizer.java:836)
at java.util.concurrent.locks.AbstractQueuedSynchronizer.doAcquireShared(AbstractQueuedSynchronizer.java:967)
at java.util.concurrent.locks.AbstractQueuedSynchronizer.acquireShared(AbstractQueuedSynchronizer.java:1283)
at java.util.concurrent.locks.ReentrantReadWriteLock$ReadLock.lock(ReentrantReadWriteLock.java:727)
at org.opengrok.suggest.SuggesterProjectData.incrementSearchCount(SuggesterProjectData.java:410)
at org.opengrok.suggest.SuggesterProjectData.incrementSearchCount(SuggesterProjectData.java:397)
at org.opengrok.suggest.Suggester.onSearch(Suggester.java:415)
at org.opengrok.web.api.v1.suggester.provider.service.impl.SuggesterServiceImpl.onSearch(SuggesterServiceImpl.java:246)
at org.opengrok.web.api.v1.controller.SearchController.search(SearchController.java:81)
at sun.reflect.GeneratedMethodAccessor69.invoke(Unknown Source:-1)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:498)
The meat of SearchController#search()
is:
81 suggester.onSearch(projects, engine.getQuery());
82
83 Map<String, List<SearchHit>> hits = engine.search(req, projects, startDocIndex, maxResults)
84 .stream()
85 .collect(Collectors.groupingBy(Hit::getPath,
86 Collectors.mapping(h -> new SearchHit(h.getLine(), h.getLineno()), Collectors.toList())));
and onSearch()
tries to grab the RW lock for reading in SuggesterProjectData#incrementSearchCount()
however the lock is locked for writing in SuggesterProjectData#rebuild()
hence the search requests are stalled until the rebuild is complete.
Full thread dump is available here: threads_report.txt
The fix might be to soften the locking performed by onSearch()
to utilize SuggesterProjectData#tryLock()
.
The downside of this approach of course would be that search count for terms searched during rebuild will not be recorded. I think we can live with that, at least as a temporary measure. Otherwise, the requests would have to be queued and processed after rebuild - basically convert the increment operation to be asynchronous.
The downside of this approach of course would be that search count for terms searched during rebuild will not be recorded.
Yes, I think it's not a big deal.
Otherwise, the requests would have to be queued and processed after rebuild - basically convert the increment operation to be asynchronous.
Ideally.
I confirmed that this irresponsive problem was fixed on 1.1.2. Thank you very much!
Tomcat does suggester rebuild starting at midnight (0:00am) for each web apps and it gets OpenGrok's search function irresponsive for about 15 seconds for each web apps. We have 120+ branches on one Tomcat and each branch is one web app (.war files) under /webapps. (There are 120+ war files.)
This suggester rebuild starts from exactly 0:00am and ends around 0:30am. (15sec x 120 / 60 = 30min) During these periods (about 30 min, so far) Tomcat's CPU usage gets extremely high and we can not use the search function at all. (We can still see directory lists and browse file contents though.)
Here is a example in a catalina.log
Environment: Tomcat 9.0.13 Java 1.8.0_192 OpenGrok 1.1 (for the most of .war files, some are 1.1-rc58)
We provide OpenGrok service 24 hours for world wide users in our company. 30 minutes gap is not acceptable, I think.
Please provide any solutions or suggestions to avoid this situation. Thank you very much!