googlegsa / manager.v3

Google Search Appliance Connector Manager
Apache License 2.0
10 stars 10 forks source link

Snapshot corruption causes stopping index forever #233

Closed GoogleCodeExporter closed 9 years ago

GoogleCodeExporter commented 9 years ago
What steps will reproduce the problem?
1. Running the connector manager.
2. Adding an instance of file-system connector that points to a repository has 
few hundred of thousands of documents.
3. Configure these instance to index 1000 documents per minute.
4. Letting the connector manager run to indexing overnight.
5. Shutting down and restarting the connector manager few times. 

What is the expected output? What do you see instead?
The expectation is the file-system connector could be able to continue to index 
gracefully regardless of how many time the connector manager is restarted for 
there are many reasons could cause this happen like tomcat shutdown, server 
maintain, electricity off, etc. However, when restarting the connector manager, 
there might be chances that the following exception is thrown and the 
file-system connector stops indexing forever. This exception happens very 
dramatically, sometimes the indexing still good without the exception, 
sometimes the exception is thrown and the only thing I could do is re-indexing 
the whole repository of that instance. 

Feb 16, 2012 9:45:27 AM 
com.google.enterprise.connector.traversal.QueryTraverser runBatch
WARNING: resumeTraversal threw exception: 
com.google.enterprise.connector.spi.RepositoryException: Snapshot recovery 
failed.: failed to open snapshot: 1
    at com.google.enterprise.connector.util.diffing.DocumentSnapshotRepositoryMonitorManagerImpl.start(DocumentSnapshotRepositoryMonitorManagerImpl.java:175)
    at com.google.enterprise.connector.util.diffing.DiffingConnectorTraversalManager.resumeTraversal(DiffingConnectorTraversalManager.java:126)
    at com.google.enterprise.connector.traversal.QueryTraverser.runBatch(QueryTraverser.java:140)
    at com.google.enterprise.connector.instantiator.CancelableBatch.run(CancelableBatch.java:74)
    at com.google.enterprise.connector.instantiator.ThreadPool$LazyThreadPool$CancelTimeoutRunnable.run(ThreadPool.java:298)
    at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:441)
    at java.util.concurrent.FutureTask$Sync.innerRun(FutureTask.java:303)
    at java.util.concurrent.FutureTask.run(FutureTask.java:138)
    at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:441)
    at java.util.concurrent.FutureTask$Sync.innerRun(FutureTask.java:303)
    at java.util.concurrent.FutureTask.run(FutureTask.java:138)
    at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:441)
    at java.util.concurrent.FutureTask$Sync.innerRun(FutureTask.java:303)
    at java.util.concurrent.FutureTask.run(FutureTask.java:138)
    at java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.java:886)
    at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:908)
    at java.lang.Thread.run(Thread.java:662)
Caused by: com.google.enterprise.connector.util.diffing.SnapshotStoreException: 
failed to open snapshot: 1
    at com.google.enterprise.connector.util.diffing.SnapshotStore.openSnapshot(SnapshotStore.java:234)
    at com.google.enterprise.connector.util.diffing.SnapshotStore.stitch(SnapshotStore.java:291)
    at com.google.enterprise.connector.util.diffing.DocumentSnapshotRepositoryMonitorManagerImpl.recoverSnapshotStores(DocumentSnapshotRepositoryMonitorManagerImpl.java:143)
    at com.google.enterprise.connector.util.diffing.DocumentSnapshotRepositoryMonitorManagerImpl.start(DocumentSnapshotRepositoryMonitorManagerImpl.java:172)
    ... 16 more

What version of the product are you using? On what operating system?
Using connector-manager 2.8 on Windows 7

Please provide any additional information below.
This issue is related to the way file-system works. However, the code need to 
changed is inside the code of connector manager, so I reported it here as an 
issue of the connector manager

Original issue reported on code.google.com by pnguyen1...@gmail.com on 17 Feb 2012 at 7:35

GoogleCodeExporter commented 9 years ago
Duplicate of Google bug #6019938, which is fixed and will be part of the 2.8.4 
release.

Original comment by jla...@google.com on 23 Feb 2012 at 9:55

GoogleCodeExporter commented 9 years ago
How could I get to the new code that fixed the issue?

Original comment by pnguyen1...@gmail.com on 28 Feb 2012 at 8:48

GoogleCodeExporter commented 9 years ago
I have checked the latest code and I believe the bug is different. This bug
needs to be fixed in the RecoveryFile class. The reason causes this is the
way you check the two recovery files which one is older. The
System.nanoTime can only be used to measure elapsed time since some fixed
but arbitrary time. It could not be used to check which file is older based
on its values is bigger or not. The number came back could be smaller
causing the recovery file is not updated. The next time when the connector
manager tried to read the recovery file, it will get older info causing it
accesses to a snapshot that has been deleted and through out that exception.

 Hope this will help fixing the bug.

Original comment by pnguyen1...@gmail.com on 28 Feb 2012 at 9:05