Committer is leading to lock issues with the mvstore

SaschaHeyer commented 5 years ago

Hi Google Team,

we found a critical issues when running Norconex in a scheduled manner (short crawl interval of 1 minute). When Norconex is indexing documents and is done with indexing the norconex process is marked as completed even though the committer plugin is still committing files.

This leads to lock errors when the next scheduled crawl run is starting and the crawler fails.

The file is locked: nio:./obfuscated-output/crawlstore/mvstore/obfuscated/mvstore [1.4.199/7]

The issues is reproducible with the following steps:

Start a crawl run every minute (using crontab for example)
Index a large file or multiple files) You can use our test file 100MB https://storage.googleapis.com/norconex-issue-reproduction/test.pdf
The first crawl run is working as expected
The crawl runs after this are failing with lock error

Please let me know if you need further information

Best regards Sascha

srinicodebytes commented 5 years ago

Hi Sascha,

Can you enable SDK logging and let us know if you find any issues.

Srinivas.

SaschaHeyer commented 5 years ago

Hi @srinicodebytes ,

Can you tell us how we can enable the SDK logging? https://developers.google.com/cloud-search/docs/reference/connector-configuration, seems not to be documented.

Like that?

log4j.logger.com.norconex.committer.googlecloudsearch=DEBUG

We will then enable the SDK logging and update this issue accordingly.

With the described steps you are able to reproduce the behavior also on your environment. This way we don't have to ping pong log messages and you can reproduce and test the issue directly on your machine.

Best regards Sascha

srinicodebytes commented 5 years ago

Hi Sascha,

You can add "-Djava.util.logging.config.file=logging.properties" to collector-http.sh as the first config option to java.

How long does it take to process 100mb pdf file in your environment?.

We just want to see if there are any other errors in sdk logging. I was able to reproduce the error. However, this seems to be due to the second Norconex Committer instance trying to open crawlstore while the first committer is still processing and has lock on it.

https://github.com/Norconex/collector-http/issues/615

Srinivas.

srinicodebytes commented 5 years ago

https://github.com/Norconex/collector-http/issues/615

SaschaHeyer commented 5 years ago

Hi Srinivas,

thank you for your feedback.

To be exact is does not have to be a single file with 100mb, this issue is also reproducible with multiple smaller files. I just took the larger file to make the reproduction easier.

In our customer environments we seeing this behavior also without large files. Its caused because the queue is still being processed while another crawl run is already starting even though the crawler is not marked as completed.

Exactly the Norconex Committer instance is trying to open the crawlstore, while the first committer is still processing.

I added two additional log files

The first crawl run, as you can see the crawler is not marked as completed. first.log
The second run immediately fails second.log

And it seems as soon as the next crawlrun is starting the committer is not able to finish the process. Which means some documents are might missing in the index. (We are currently investigating this behavior as well, further details soon).

Best regards Sascha

srinicodebytes commented 5 years ago

Hi Sascha,

Yes, this can be reproduced with multiple smaller files.

In my testing, the first committer did finish indexing all the files. The second one ran into the file locked error. After the first committer finished indexing the files, all the subsequent committers have "REJECTED_UNMODIFIED" message for all the files. I didn't see any files missed to index.

Wonder why yours didn't finish. Did you check the backup logs? As soon as the second committer is launched, the first committer's log gets renamed and gets moved to backup folder, while it is still processing/indexing files.

Srinivas.

SaschaHeyer commented 5 years ago

Hi @srinicodebytes

even if the first committer is processing we should not get the lock error message.

Here is some feedback from the Norconex developer https://github.com/Norconex/collector-http/issues/615:

"Normally you do not have to do anything since your new process would throw an exception and you should have in your logs something that starts with "JOB SUITE ALREADY RUNNING". In other words, it should detect a job is already running and abort (letting the current job finish normally)."

Seems this detection is not working properly when using the Google Cloud Search Committer.

Best regards Sascha

srinicodebytes commented 5 years ago

Hi Sascha,

Did your testing miss indexing any files? Are these modified while the first Norconex is still running? Or the new files that weren't processed before? Just want to make sure no files are missed indexing. If some files are missed indexing, please provide additional information. I could not reproduce it.

Yes, the tests log "JOB SUITE ALREADY RUNNING" message and can be seen if the crontab console output is redirected. Or by manually invoking another Norconex instance in a terminal window, soon after the first one is launched.

My observation is that any Norconex instance launched after the committer starts, the "The file is locked" error is seen. For some reason, while committer is processing files, Norconex doesn't detect that the job is already running, and hence runs into file lock error.

It can be seen from logs, committer starts from messages " INFO - Starting up! INFO - Created indexingService: 0" and ends at "INFO - Stopping indexingService: 0 INFO - Shutting down (took: xxxx ms)! ". Any Norconex instance invoked during this time runs into file lock error. And any Norconex invoked before this detects earlier Norconex and aborts with "JOB SUITE ALREADY RUNNING" message.

Google Cloud Search Norconex committer plugin doesn't lock or release Norconex crawlstore file. It's the Norconex caller that takes care of locking and releasing crawlstore file.

I have tested Norconex SQL Committer as well using crontab invoking Norconex every 1 minute, and have seen this same behavior.

Regards, Srinivas.

SaschaHeyer commented 5 years ago

Hi Srinivas,

Please let us solve one issue after another (let's ignore the files which might are missing).

First we need a solution for the file is locked issue. As you also recognized for some reasons Norconex doesn't detect that the job is running and runs into lock issues. Since this issues does not exist when not using the Google Cloud Search Committer I assume there is a correlation.

Are you in contact with the Norconex team in order to investigate this behavior?

Best regards Sascha

srinicodebytes commented 5 years ago

Hi Sascha,

I looked at the other issue you opened with Norconex (issue#634). Looks like Norconex was unable to reproduce the issue. Were you able to reproduce the issue with SQL Committer?

As mentioned, the issue, "The file is locked" error, happens when another Norconex instance is started while the committer is still processing the files. The trick is to let the committer process for long enough time ( ex: 10 sec or more), so that another Norconex instance can be started. This can be done by either of two ways: 1. Have large number of files for processing depending on the test environment. 2. Introducing a delay in committer for this testing. For example introduce a delay of 10 sec right above https://github.com/Norconex/committer-sql/blob/master/norconex-committer-sql/src/main/java/com/norconex/committer/sql/SQLCommitter.java#L465 for this testing.

Hope this helps in reproducing the issue in SQL Committer.

Best Regards, Srinivas.

SaschaHeyer commented 5 years ago

Hi Srinivas,

I assume Norconex did not tested it with the Google Cloud Search Committer and therefore was not able to reproduce the behavior.

Why are you talking about the SQL Committer, we don't use it and this issue is not related to it.

I know that "The file is locked" error, happens when another Norconex instance is started while the committer is still processing the files. That's why we created this issue.

The only questions related to this issue is "how can we solve this lock error"?

Best regards Sascha

srinicodebytes commented 5 years ago

Hi Sascha,

The reason to test with SQL Committer is to isolate and root cause the issue. With this test, it is clear that Google Cloud Search Committer is not causing the issue.

As mentioned earlier, Google Cloud Search Norconex Committer plugin doesn't lock or release Norconex crawlstore file. It's the Norconex caller that takes care of locking and releasing crawlstore file.

As this issue is caused by a bug in Norconex side, would like to close this issue. Please let us know if you need further help on this.

Best Regards, Srinivas.

essiembre commented 5 years ago

FYI, a fix has been provided in the latest snapshot release: https://github.com/Norconex/collector-http/issues/634

google-cloudsearch / norconex-committer-plugin

Committer is leading to lock issues with the mvstore #15