google-cloudsearch / norconex-committer-plugin

Google Cloud Search Norconex HTTP Collector Indexer Plugin
Apache License 2.0
5 stars 7 forks source link

Norconex Collector or Cloud Search Committer process not exiting #17

Open MathewLeung opened 4 years ago

MathewLeung commented 4 years ago

Hi

We have observed a periodic problem, where the Norconex collector process does not exit even when the log indicates the process is completed. We are running committer plugin v1-0.0.5 with Norconex collector (v2.9.1) in a docker container based on instructions found here. In the start script we have a line to echo a success or error when the collector process returns, but occasionally we may not see the return. Please see attached log entries below

A 2020-07-11T16:53:11.095961101Z INFO  [GoogleCloudSearchCommitter] Document (text/html) indexed (213 KB / 1288ms): https://a.domain.com/index.html
A 2020-07-11T16:53:11.096007563Z INFO  [GoogleCloudSearchCommitter] Indexing Service release reference count: 1
A 2020-07-11T16:53:11.096015720Z INFO  [GoogleCloudSearchCommitter] Stopping indexingService: 0 
A 2020-07-11T16:53:11.097438429Z Jul 11, 2020 4:53:11 PM com.google.enterprise.cloudsearch.sdk.BatchRequestService shutDown 
A 2020-07-11T16:53:11.097483701Z INFO: Shutting down batching service. flush on shutdown: true 
A 2020-07-11T16:53:14.808163240Z INFO  [GoogleCloudSearchCommitter] Shutting down (took: 3712ms)! 
A 2020-07-11T16:53:14.808202056Z INFO  [GoogleCloudSearchCommitter] Indexing Service reference count: 0 
A 2020-07-11T16:53:14.946960495Z INFO  [AbstractCrawler] Crawler A: 4310 reference(s) processed. 
A 2020-07-11T16:53:14.947056707Z INFO  [CrawlerEventManager]          CRAWLER_FINISHED 
A 2020-07-11T16:53:14.947143098Z INFO  [AbstractCrawler] Crawler A: Crawler completed. 
A 2020-07-11T16:53:14.948546999Z INFO  [AbstractCrawler] Crawler A: Crawler executed in 5 hours 1 minute 18 seconds. 
A 2020-07-11T16:53:14.948568748Z INFO  [SitemapStore] Crawler A: Closing sitemap store... 
A 2020-07-11T16:53:14.953454126Z INFO  [JobSuite] Running Crawler A: END (Sat Jul 11 11:51:56 UTC 2020) 
<!-- occasionally the log entries stop at above line, and VM process will stall and consume minimal CPU resource. We expect the log entries below to display after index items are committed and crawl completes successfully -->
2020-07-15Txx:xx:xxZ INFO  [JobSuite] Running Cloud Search HTTP Collector: END 
2020-07-15Txx:xx:xxZ crawl process exited successfully   <--- echoed from the start.sh after command 'collector-http.sh -a start -c ...', but does not always reach this point

start.sh

#!/bin/bash
#set -x
#set -e

${CRAWLER_HOME}/collector-http.sh -a start -c ${CRAWLER_HOME}/config/crawler-config.xml
if [ $? == 0 ]
then
    echo "$(date) crawl process exited successfully";
else
    echo "$(date) - Error occurred running crawler.";
fi

Is there a good explanation why this happens? Could it be an issue with the committer? or has the log highlighted it is an issue of the collector? I am looking for an idea to troubleshoot this issue, wonder if anyone has seen this or could provide any direction? Appreciate your time on this.

essiembre commented 4 years ago

For reference, this relates to https://github.com/Norconex/collector-http/issues/708