Suggester rebuild by Tomcat gets search function irresponsive until the rebuild is completed

ghost commented 5 years ago

Tomcat does suggester rebuild starting at midnight (0:00am) for each web apps and it gets OpenGrok's search function irresponsive for about 15 seconds for each web apps. We have 120+ branches on one Tomcat and each branch is one web app (.war files) under /webapps. (There are 120+ war files.)
This suggester rebuild starts from exactly 0:00am and ends around 0:30am. (15sec x 120 / 60 = 30min) During these periods (about 30 min, so far) Tomcat's CPU usage gets extremely high and we can not use the search function at all. (We can still see directory lists and browse file contents though.)

Here is a example in a catalina.log

pisces@sjc-opengrok-1>head -n 387 catalina.2018-12-17.log | grep "pool-475"
17-Dec-2018 00:00:00.069 INFO [pool-475-thread-1] org.opengrok.suggest.Suggester.rebuild Rebuilding following suggesters: []
17-Dec-2018 00:30:01.549 INFO [pool-475-thread-1] org.opengrok.suggest.Suggester.shutdownAndAwaitTermination Suggesters for [] were successfully rebuilt
17-Dec-2018 00:30:01.552 INFO [pool-475-thread-1] org.opengrok.web.api.v1.suggester.provider.service.impl.SuggesterServiceImpl.scheduleRebuild Scheduling suggester rebuild in PT23H29M58.449S

Environment: Tomcat 9.0.13 Java 1.8.0_192 OpenGrok 1.1 (for the most of .war files, some are 1.1-rc58)

We provide OpenGrok service 24 hours for world wide users in our company. 30 minutes gap is not acceptable, I think.

Please provide any solutions or suggestions to avoid this situation. Thank you very much!

vladak commented 5 years ago

I think I noticed this too after redeploy. Still not sure whether this is caused by CPU or I/O load (or both).

vladak commented 5 years ago

Assuming the workload created by the suggester rebuild is indeed CPU bound, we can try setting thread priorities to minimum (https://docs.oracle.com/javase/8/docs/api/java/lang/Thread.html#MIN_PRIORITY) by overriding thread factories used for suggester's thread pools. It is using 2 executor services (http://demo.opengrok.org/xref/opengrok/suggester/src/main/java/org/opengrok/suggest/Suggester.java?r=f96b2fc6):

87      // do NOT use fork join thread pool (work stealing thread pool) because it does not send interrupts upon cancellation
88      private final ExecutorService executorService = Executors.newFixedThreadPool(
89              Runtime.getRuntime().availableProcessors());
...

227              ExecutorService executor = Executors.newWorkStealingPool();

Also, I wonder if this could help given https://stackoverflow.com/a/1617989 , i.e. it might be necessary to yield if the workload is purely CPU bound.

vladak commented 5 years ago

Another idea would be to limit the size of the pool significantly below availableProcessors() - maybe make it a tunable in suggester configuration ?

vladak commented 5 years ago

Also, using dedicated thread factory will have the benefit of using more descriptive thread names.

vladak commented 5 years ago

I added new API endpoint to SuggesterController to be able to trigger rebuild on demand and started a Python script to perform search query sequentially and record the times from the response. Once the rebuild starts running, not even one search request goes through until the rebuild is done. During the rebuild neither CPU or I/O are saturated. There has to be something else which blocks the queries.

vladak commented 5 years ago

Marking as stopper since this is quite serious bug.

tulinkry commented 5 years ago

Suggester itself holds a readwrite lock in write mode while rebuilding. There must be something in the search queries which does a lookup or something and fights with the readwrite lock.

vladak commented 5 years ago

The trouble is that search reaches to the suggester, trying to grab a lock:

"http-nio-8080-exec-10@8218" daemon prio=5 tid=0x1883 nid=NA waiting
  java.lang.Thread.State: WAITING
      at sun.misc.Unsafe.park(Unsafe.java:-1)
      at java.util.concurrent.locks.LockSupport.park(LockSupport.java:175)
      at java.util.concurrent.locks.AbstractQueuedSynchronizer.parkAndCheckInterrupt(AbstractQueuedSynchronizer.java:836)
      at java.util.concurrent.locks.AbstractQueuedSynchronizer.doAcquireShared(AbstractQueuedSynchronizer.java:967)
      at java.util.concurrent.locks.AbstractQueuedSynchronizer.acquireShared(AbstractQueuedSynchronizer.java:1283)
      at java.util.concurrent.locks.ReentrantReadWriteLock$ReadLock.lock(ReentrantReadWriteLock.java:727)
      at org.opengrok.suggest.SuggesterProjectData.incrementSearchCount(SuggesterProjectData.java:410)
      at org.opengrok.suggest.SuggesterProjectData.incrementSearchCount(SuggesterProjectData.java:397)
      at org.opengrok.suggest.Suggester.onSearch(Suggester.java:415)
      at org.opengrok.web.api.v1.suggester.provider.service.impl.SuggesterServiceImpl.onSearch(SuggesterServiceImpl.java:246)
      at org.opengrok.web.api.v1.controller.SearchController.search(SearchController.java:81)
      at sun.reflect.GeneratedMethodAccessor69.invoke(Unknown Source:-1)
      at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
      at java.lang.reflect.Method.invoke(Method.java:498)

The meat of SearchController#search() is:


81              suggester.onSearch(projects, engine.getQuery());
82  
83              Map<String, List<SearchHit>> hits = engine.search(req, projects, startDocIndex, maxResults)
84                      .stream()
85                      .collect(Collectors.groupingBy(Hit::getPath,
86                              Collectors.mapping(h -> new SearchHit(h.getLine(), h.getLineno()), Collectors.toList())));

and onSearch() tries to grab the RW lock for reading in SuggesterProjectData#incrementSearchCount() however the lock is locked for writing in SuggesterProjectData#rebuild() hence the search requests are stalled until the rebuild is complete.

Full thread dump is available here: threads_report.txt

vladak commented 5 years ago

The fix might be to soften the locking performed by onSearch() to utilize SuggesterProjectData#tryLock().

vladak commented 5 years ago

The downside of this approach of course would be that search count for terms searched during rebuild will not be recorded. I think we can live with that, at least as a temporary measure. Otherwise, the requests would have to be queued and processed after rebuild - basically convert the increment operation to be asynchronous.

tulinkry commented 5 years ago

The downside of this approach of course would be that search count for terms searched during rebuild will not be recorded.

Yes, I think it's not a big deal.

Otherwise, the requests would have to be queued and processed after rebuild - basically convert the increment operation to be asynchronous.

Ideally.

ghost commented 5 years ago

I confirmed that this irresponsive problem was fixed on 1.1.2. Thank you very much!

oracle / opengrok

Suggester rebuild by Tomcat gets search function irresponsive until the rebuild is completed #2593