opensearch-project / OpenSearch

🔎 Open source distributed and RESTful search engine.
https://opensearch.org/docs/latest/opensearch/index/
Apache License 2.0
9.8k stars 1.82k forks source link

[Segment Replication] [BUG] No such file exception due to missing index file in get_checkpoint_info #4310

Closed dreamer-89 closed 2 years ago

dreamer-89 commented 2 years ago

Describe the bug

On primary, while building CopyState object as part of get_checkpoint_info transport response, missing index file on local store causes FileNotFoundException. This is repro-able on main containing https://github.com/opensearch-project/OpenSearch/pull/4288 fix

It looks like indexShard object built from IndexService seems to be outdated (still referencing _9.cfe but latest in memory SegmentInfos fetched using getLatestSegmentInfos() in InternalEngine doesn't contains it) file but when metadata is read this file is long ago gone from shard store. From logs (prints SegmentInfos from in memory & disk store) shows that in memory copy held _9.cfe file once but later (index refresh/commit ?) removes all these files and builds new set of files. It is interesting to know why indexShard (built from indexService) is not upto date with in memory state of SegmentInfos

Log traces

[2022-08-26T15:47:00,116][INFO ][o.o.i.e.Engine           ] [node_t1] [test-idx-1][0] --> ---------------------- printing files in SegmentInfos -------------------------
[2022-08-26T15:47:00,116][INFO ][o.o.i.e.Engine           ] [node_t1] [test-idx-1][0] --> segments_2
[2022-08-26T15:47:00,116][INFO ][o.o.i.e.Engine           ] [node_t1] [test-idx-1][0] --> _1.cfs
[2022-08-26T15:47:00,116][INFO ][o.o.i.e.Engine           ] [node_t1] [test-idx-1][0] --> _1.cfe
[2022-08-26T15:47:00,116][INFO ][o.o.i.e.Engine           ] [node_t1] [test-idx-1][0] --> _1.si
[2022-08-26T15:47:00,116][INFO ][o.o.i.e.Engine           ] [node_t1] [test-idx-1][0] --> _0.cfe
[2022-08-26T15:47:00,117][INFO ][o.o.i.e.Engine           ] [node_t1] [test-idx-1][0] --> _0.si
[2022-08-26T15:47:00,117][INFO ][o.o.i.e.Engine           ] [node_t1] [test-idx-1][0] --> _0.cfs
[2022-08-26T15:47:00,117][INFO ][o.o.i.e.Engine           ] [node_t1] [test-idx-1][0] --> _3.si
[2022-08-26T15:47:00,117][INFO ][o.o.i.e.Engine           ] [node_t1] [test-idx-1][0] --> _3.cfs
[2022-08-26T15:47:00,117][INFO ][o.o.i.e.Engine           ] [node_t1] [test-idx-1][0] --> _3.cfe
[2022-08-26T15:47:00,117][INFO ][o.o.i.e.Engine           ] [node_t1] [test-idx-1][0] --> _2.si
[2022-08-26T15:47:00,117][INFO ][o.o.i.e.Engine           ] [node_t1] [test-idx-1][0] --> _2.cfe
[2022-08-26T15:47:00,117][INFO ][o.o.i.e.Engine           ] [node_t1] [test-idx-1][0] --> _2.cfs
[2022-08-26T15:47:00,117][INFO ][o.o.i.e.Engine           ] [node_t1] [test-idx-1][0] --> _4.cfe
[2022-08-26T15:47:00,117][INFO ][o.o.i.e.Engine           ] [node_t1] [test-idx-1][0] --> _4.cfs
[2022-08-26T15:47:00,117][INFO ][o.o.i.e.Engine           ] [node_t1] [test-idx-1][0] --> _4.si
[2022-08-26T15:47:00,118][INFO ][o.o.i.e.Engine           ] [node_t1] [test-idx-1][0] --> _5.cfs
[2022-08-26T15:47:00,118][INFO ][o.o.i.e.Engine           ] [node_t1] [test-idx-1][0] --> _5.cfe
[2022-08-26T15:47:00,118][INFO ][o.o.i.e.Engine           ] [node_t1] [test-idx-1][0] --> _5.si
[2022-08-26T15:47:00,118][INFO ][o.o.i.e.Engine           ] [node_t1] [test-idx-1][0] --> _6.cfe
[2022-08-26T15:47:00,118][INFO ][o.o.i.e.Engine           ] [node_t1] [test-idx-1][0] --> _6.cfs
[2022-08-26T15:47:00,118][INFO ][o.o.i.e.Engine           ] [node_t1] [test-idx-1][0] --> _6.si
[2022-08-26T15:47:00,118][INFO ][o.o.i.e.Engine           ] [node_t1] [test-idx-1][0] --> _7.cfs
[2022-08-26T15:47:00,118][INFO ][o.o.i.e.Engine           ] [node_t1] [test-idx-1][0] --> _7.cfe
[2022-08-26T15:47:00,118][INFO ][o.o.i.e.Engine           ] [node_t1] [test-idx-1][0] --> _7.si
[2022-08-26T15:47:00,118][INFO ][o.o.i.e.Engine           ] [node_t1] [test-idx-1][0] --> _9.cfe
[2022-08-26T15:47:00,118][INFO ][o.o.i.e.Engine           ] [node_t1] [test-idx-1][0] --> _9.cfs
[2022-08-26T15:47:00,118][INFO ][o.o.i.e.Engine           ] [node_t1] [test-idx-1][0] --> _9.si
[2022-08-26T15:47:00,118][INFO ][o.o.i.e.Engine           ] [node_t1] [test-idx-1][0] --> _8.cfe
[2022-08-26T15:47:00,118][INFO ][o.o.i.e.Engine           ] [node_t1] [test-idx-1][0] --> _8.cfs
[2022-08-26T15:47:00,119][INFO ][o.o.i.e.Engine           ] [node_t1] [test-idx-1][0] --> _8.si
[2022-08-26T15:47:00,129][INFO ][o.o.i.e.Engine           ] [node_t1] [test-idx-1][0] --> ---------------------- printing files in SegmentInfos On-disk -------------------------
[2022-08-26T15:47:00,129][INFO ][o.o.i.e.Engine           ] [node_t1] [test-idx-1][0] --> segments_2
[2022-08-26T15:47:00,135][INFO ][o.o.i.e.Engine           ] [node_t1] [test-idx-1][0] --> ---------------------- printing files in SegmentInfos -------------------------
[2022-08-26T15:47:00,135][INFO ][o.o.i.e.Engine           ] [node_t1] [test-idx-1][0] --> segments_2
[2022-08-26T15:47:00,135][INFO ][o.o.i.e.Engine           ] [node_t1] [test-idx-1][0] --> _b.nvd
[2022-08-26T15:47:00,135][INFO ][o.o.i.e.Engine           ] [node_t1] [test-idx-1][0] --> _b.si
[2022-08-26T15:47:00,136][INFO ][o.o.i.e.Engine           ] [node_t1] [test-idx-1][0] --> _b_Lucene90_0.pos
[2022-08-26T15:47:00,136][INFO ][o.o.i.e.Engine           ] [node_t1] [test-idx-1][0] --> _b_Lucene90_0.tmd
[2022-08-26T15:47:00,136][INFO ][o.o.i.e.Engine           ] [node_t1] [test-idx-1][0] --> _b.nvm
[2022-08-26T15:47:00,136][INFO ][o.o.i.e.Engine           ] [node_t1] [test-idx-1][0] --> _b_Lucene90_0.dvd
[2022-08-26T15:47:00,136][INFO ][o.o.i.e.Engine           ] [node_t1] [test-idx-1][0] --> _b.fnm
[2022-08-26T15:47:00,136][INFO ][o.o.i.e.Engine           ] [node_t1] [test-idx-1][0] --> _b.kdi
[2022-08-26T15:47:00,136][INFO ][o.o.i.e.Engine           ] [node_t1] [test-idx-1][0] --> _b.fdm
[2022-08-26T15:47:00,136][INFO ][o.o.i.e.Engine           ] [node_t1] [test-idx-1][0] --> _b_Lucene90_0.dvm
[2022-08-26T15:47:00,136][INFO ][o.o.i.e.Engine           ] [node_t1] [test-idx-1][0] --> _b_Lucene90_0.doc
[2022-08-26T15:47:00,136][INFO ][o.o.i.e.Engine           ] [node_t1] [test-idx-1][0] --> _b.kdd
[2022-08-26T15:47:00,136][INFO ][o.o.i.e.Engine           ] [node_t1] [test-idx-1][0] --> _b.fdx
[2022-08-26T15:47:00,136][INFO ][o.o.i.e.Engine           ] [node_t1] [test-idx-1][0] --> _b_Lucene90_0.tim
[2022-08-26T15:47:00,136][INFO ][o.o.i.e.Engine           ] [node_t1] [test-idx-1][0] --> _b_Lucene90_0.tip
[2022-08-26T15:47:00,136][INFO ][o.o.i.e.Engine           ] [node_t1] [test-idx-1][0] --> _b.fdt
[2022-08-26T15:47:00,136][INFO ][o.o.i.e.Engine           ] [node_t1] [test-idx-1][0] --> _b.kdm
[2022-08-26T15:47:00,136][INFO ][o.o.i.e.Engine           ] [node_t1] [test-idx-1][0] --> _2.si
[2022-08-26T15:47:00,137][INFO ][o.o.i.e.Engine           ] [node_t1] [test-idx-1][0] --> _2.cfe
[2022-08-26T15:47:00,137][INFO ][o.o.i.e.Engine           ] [node_t1] [test-idx-1][0] --> _2.cfs
[2022-08-26T15:47:00,140][INFO ][o.o.i.e.Engine           ] [node_t1] [test-idx-1][0] --> ---------------------- printing files in SegmentInfos On-disk -------------------------
[2022-08-26T15:47:00,140][INFO ][o.o.i.e.Engine           ] [node_t1] [test-idx-1][0] --> segments_2
[2022-08-26T15:47:00,142][INFO ][o.o.i.e.Engine           ] [node_t1] [test-idx-1][0] --> ---------------------- printing files in SegmentInfos -------------------------
[2022-08-26T15:47:00,142][INFO ][o.o.i.e.Engine           ] [node_t1] [test-idx-1][0] --> segments_2
[2022-08-26T15:47:00,142][INFO ][o.o.i.e.Engine           ] [node_t1] [test-idx-1][0] --> _b.nvd
[2022-08-26T15:47:00,143][INFO ][o.o.i.e.Engine           ] [node_t1] [test-idx-1][0] --> _b.si
[2022-08-26T15:47:00,143][INFO ][o.o.i.e.Engine           ] [node_t1] [test-idx-1][0] --> _b_Lucene90_0.pos
[2022-08-26T15:47:00,143][INFO ][o.o.i.e.Engine           ] [node_t1] [test-idx-1][0] --> _b_Lucene90_0.tmd
[2022-08-26T15:47:00,143][INFO ][o.o.i.e.Engine           ] [node_t1] [test-idx-1][0] --> _b.nvm
[2022-08-26T15:47:00,143][INFO ][o.o.i.e.Engine           ] [node_t1] [test-idx-1][0] --> _b_Lucene90_0.dvd
[2022-08-26T15:47:00,143][INFO ][o.o.i.e.Engine           ] [node_t1] [test-idx-1][0] --> _b.fnm
[2022-08-26T15:47:00,143][INFO ][o.o.i.e.Engine           ] [node_t1] [test-idx-1][0] --> _b.kdi
[2022-08-26T15:47:00,143][INFO ][o.o.i.e.Engine           ] [node_t1] [test-idx-1][0] --> _b.fdm
[2022-08-26T15:47:00,143][INFO ][o.o.i.e.Engine           ] [node_t1] [test-idx-1][0] --> _b_Lucene90_0.dvm
[2022-08-26T15:47:00,143][INFO ][o.o.i.e.Engine           ] [node_t1] [test-idx-1][0] --> _b_Lucene90_0.doc
[2022-08-26T15:47:00,143][INFO ][o.o.i.e.Engine           ] [node_t1] [test-idx-1][0] --> _b.kdd
[2022-08-26T15:47:00,143][INFO ][o.o.i.e.Engine           ] [node_t1] [test-idx-1][0] --> _b.fdx
[2022-08-26T15:47:00,143][INFO ][o.o.i.e.Engine           ] [node_t1] [test-idx-1][0] --> _b_Lucene90_0.tim
[2022-08-26T15:47:00,143][INFO ][o.o.i.e.Engine           ] [node_t1] [test-idx-1][0] --> _b_Lucene90_0.tip
[2022-08-26T15:47:00,143][INFO ][o.o.i.e.Engine           ] [node_t1] [test-idx-1][0] --> _b.fdt
[2022-08-26T15:47:00,143][INFO ][o.o.i.e.Engine           ] [node_t1] [test-idx-1][0] --> _b.kdm
[2022-08-26T15:47:00,144][INFO ][o.o.i.e.Engine           ] [node_t1] [test-idx-1][0] --> _2.si
[2022-08-26T15:47:00,144][INFO ][o.o.i.e.Engine           ] [node_t1] [test-idx-1][0] --> _2.cfe
[2022-08-26T15:47:00,144][INFO ][o.o.i.e.Engine           ] [node_t1] [test-idx-1][0] --> _2.cfs
[2022-08-26T15:47:00,148][INFO ][o.o.i.e.Engine           ] [node_t1] [test-idx-1][0] --> ---------------------- printing files in SegmentInfos -------------------------
[2022-08-26T15:47:00,148][INFO ][o.o.i.e.Engine           ] [node_t1] [test-idx-1][0] --> segments_2
[2022-08-26T15:47:00,159][INFO ][o.o.i.e.Engine           ] [node_t1] [test-idx-1][0] --> ---------------------- printing files in SegmentInfos -------------------------
[2022-08-26T15:47:00,159][INFO ][o.o.i.e.Engine           ] [node_t1] [test-idx-1][0] --> segments_2
[2022-08-26T15:47:00,159][INFO ][o.o.i.e.Engine           ] [node_t1] [test-idx-1][0] --> _b.nvd
[2022-08-26T15:47:00,159][INFO ][o.o.i.e.Engine           ] [node_t1] [test-idx-1][0] --> _b.si
[2022-08-26T15:47:00,159][INFO ][o.o.i.e.Engine           ] [node_t1] [test-idx-1][0] --> _b_Lucene90_0.pos
[2022-08-26T15:47:00,159][INFO ][o.o.i.e.Engine           ] [node_t1] [test-idx-1][0] --> _b_Lucene90_0.tmd
[2022-08-26T15:47:00,159][INFO ][o.o.i.e.Engine           ] [node_t1] [test-idx-1][0] --> _b.nvm
[2022-08-26T15:47:00,159][INFO ][o.o.i.e.Engine           ] [node_t1] [test-idx-1][0] --> _b_Lucene90_0.dvd
[2022-08-26T15:47:00,159][INFO ][o.o.i.e.Engine           ] [node_t1] [test-idx-1][0] --> _b.fnm
[2022-08-26T15:47:00,160][INFO ][o.o.i.e.Engine           ] [node_t1] [test-idx-1][0] --> _b.kdi
[2022-08-26T15:47:00,160][INFO ][o.o.i.e.Engine           ] [node_t1] [test-idx-1][0] --> _b.fdm
[2022-08-26T15:47:00,160][INFO ][o.o.i.e.Engine           ] [node_t1] [test-idx-1][0] --> _b_Lucene90_0.dvm
[2022-08-26T15:47:00,160][INFO ][o.o.i.e.Engine           ] [node_t1] [test-idx-1][0] --> _b_Lucene90_0.doc
[2022-08-26T15:47:00,160][INFO ][o.o.i.e.Engine           ] [node_t1] [test-idx-1][0] --> _b.kdd
[2022-08-26T15:47:00,160][INFO ][o.o.i.e.Engine           ] [node_t1] [test-idx-1][0] --> _b.fdx
[2022-08-26T15:47:00,160][INFO ][o.o.i.e.Engine           ] [node_t1] [test-idx-1][0] --> _b_Lucene90_0.tim
[2022-08-26T15:47:00,160][INFO ][o.o.i.e.Engine           ] [node_t1] [test-idx-1][0] --> _b_Lucene90_0.tip
[2022-08-26T15:47:00,160][INFO ][o.o.i.e.Engine           ] [node_t1] [test-idx-1][0] --> _b.fdt
[2022-08-26T15:47:00,160][INFO ][o.o.i.e.Engine           ] [node_t1] [test-idx-1][0] --> _b.kdm
[2022-08-26T15:47:00,160][INFO ][o.o.i.e.Engine           ] [node_t1] [test-idx-1][0] --> _2.si
[2022-08-26T15:47:00,160][INFO ][o.o.i.e.Engine           ] [node_t1] [test-idx-1][0] --> _2.cfe
[2022-08-26T15:47:00,160][INFO ][o.o.i.e.Engine           ] [node_t1] [test-idx-1][0] --> _2.cfs
[2022-08-26T15:47:00,160][ERROR][o.o.i.r.SegmentReplicationTargetService] [node_t6] replication failure
org.opensearch.OpenSearchException: Segment Replication failed
    at org.opensearch.indices.replication.SegmentReplicationTargetService$3.onFailure(SegmentReplicationTargetService.java:251) [main/:?]
    at org.opensearch.action.ActionListener$1.onFailure(ActionListener.java:88) [main/:?]
    at org.opensearch.action.ActionRunnable.onFailure(ActionRunnable.java:103) [main/:?]
    at org.opensearch.common.util.concurrent.AbstractRunnable.run(AbstractRunnable.java:54) [main/:?]
    at org.opensearch.common.util.concurrent.OpenSearchExecutors$DirectExecutorService.execute(OpenSearchExecutors.java:341) [main/:?]
    at org.opensearch.common.util.concurrent.ListenableFuture.notifyListener(ListenableFuture.java:120) [main/:?]
    at org.opensearch.common.util.concurrent.ListenableFuture.lambda$done$0(ListenableFuture.java:112) [main/:?]
    at java.util.ArrayList.forEach(ArrayList.java:1511) [?:?]
    at org.opensearch.common.util.concurrent.ListenableFuture.done(ListenableFuture.java:112) [main/:?]
    at org.opensearch.common.util.concurrent.BaseFuture.setException(BaseFuture.java:178) [main/:?]
    at org.opensearch.common.util.concurrent.ListenableFuture.onFailure(ListenableFuture.java:149) [main/:?]
    at org.opensearch.action.StepListener.innerOnFailure(StepListener.java:82) [main/:?]
    at org.opensearch.action.NotifyOnceListener.onFailure(NotifyOnceListener.java:62) [main/:?]
    at org.opensearch.action.ActionListener$4.onFailure(ActionListener.java:190) [main/:?]
    at org.opensearch.action.ActionListener$6.onFailure(ActionListener.java:309) [main/:?]
    at org.opensearch.action.support.RetryableAction$RetryingListener.onFinalFailure(RetryableAction.java:201) [main/:?]
    at org.opensearch.action.support.RetryableAction$RetryingListener.onFailure(RetryableAction.java:193) [main/:?]
    at org.opensearch.action.ActionListenerResponseHandler.handleException(ActionListenerResponseHandler.java:74) [main/:?]
    at org.opensearch.transport.TransportService$ContextRestoreResponseHandler.handleException(TransportService.java:1379) [main/:?]
    at org.opensearch.transport.InboundHandler.lambda$handleException$3(InboundHandler.java:420) [main/:?]
    at org.opensearch.common.util.concurrent.ThreadContext$ContextPreservingRunnable.run(ThreadContext.java:747) [main/:?]
    at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1136) [?:?]
    at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:635) [?:?]
    at java.lang.Thread.run(Thread.java:833) [?:?]
Caused by: org.opensearch.transport.RemoteTransportException: [node_t1][127.0.0.1:57189][internal:index/shard/replication/get_checkpoint_info]
Caused by: java.nio.file.NoSuchFileException: /Users/singhnjb/OpenSearch/server/build/testrun/internalClusterTest/temp/org.opensearch.indices.replication.SegmentReplicationIT_15BF646F38F371F3-001/tempDir-009/node_t1-shared/mkDuAoQiFK/0/wWPbk5ETT5-ik2ZIwqmjLg/0/index/_9.cfe
    at sun.nio.fs.UnixException.translateToIOException(UnixException.java:92) ~[?:?]
    at sun.nio.fs.UnixException.rethrowAsIOException(UnixException.java:106) ~[?:?]
    at sun.nio.fs.UnixException.rethrowAsIOException(UnixException.java:111) ~[?:?]
    at sun.nio.fs.UnixFileSystemProvider.newFileChannel(UnixFileSystemProvider.java:181) ~[?:?]
    at org.apache.lucene.tests.mockfile.FilterFileSystemProvider.newFileChannel(FilterFileSystemProvider.java:204) ~[lucene-test-framework-9.4.0-snapshot-ddf0d0a.jar:9.4.0-snapshot-ddf0d0a ddf0d0acf4e4443ddea37bb855dead7bed5cc1a2 - runner - 2022-08-09 21:10:22]
    at org.apache.lucene.tests.mockfile.DisableFsyncFS.newFileChannel(DisableFsyncFS.java:44) ~[lucene-test-framework-9.4.0-snapshot-ddf0d0a.jar:9.4.0-snapshot-ddf0d0a ddf0d0acf4e4443ddea37bb855dead7bed5cc1a2 - runner - 2022-08-09 21:10:22]
    at org.apache.lucene.tests.mockfile.FilterFileSystemProvider.newFileChannel(FilterFileSystemProvider.java:204) ~[lucene-test-framework-9.4.0-snapshot-ddf0d0a.jar:9.4.0-snapshot-ddf0d0a ddf0d0acf4e4443ddea37bb855dead7bed5cc1a2 - runner - 2022-08-09 21:10:22]
    at org.apache.lucene.tests.mockfile.HandleTrackingFS.newFileChannel(HandleTrackingFS.java:171) ~[lucene-test-framework-9.4.0-snapshot-ddf0d0a.jar:9.4.0-snapshot-ddf0d0a ddf0d0acf4e4443ddea37bb855dead7bed5cc1a2 - runner - 2022-08-09 21:10:22]
    at org.apache.lucene.tests.mockfile.HandleTrackingFS.newFileChannel(HandleTrackingFS.java:171) ~[lucene-test-framework-9.4.0-snapshot-ddf0d0a.jar:9.4.0-snapshot-ddf0d0a ddf0d0acf4e4443ddea37bb855dead7bed5cc1a2 - runner - 2022-08-09 21:10:22]
    at java.nio.channels.FileChannel.open(FileChannel.java:298) ~[?:?]
    at java.nio.channels.FileChannel.open(FileChannel.java:357) ~[?:?]
    at org.apache.lucene.store.NIOFSDirectory.openInput(NIOFSDirectory.java:78) ~[lucene-core-9.4.0-snapshot-ddf0d0a.jar:9.4.0-snapshot-ddf0d0a ddf0d0acf4e4443ddea37bb855dead7bed5cc1a2 - runner - 2022-08-09 21:10:22]
    at org.opensearch.index.store.FsDirectoryFactory$HybridDirectory.openInput(FsDirectoryFactory.java:166) ~[main/:?]
    at org.apache.lucene.store.FilterDirectory.openInput(FilterDirectory.java:101) ~[lucene-core-9.4.0-snapshot-ddf0d0a.jar:9.4.0-snapshot-ddf0d0a ddf0d0acf4e4443ddea37bb855dead7bed5cc1a2 - runner - 2022-08-09 21:10:22]
    at org.apache.lucene.store.FilterDirectory.openInput(FilterDirectory.java:101) ~[lucene-core-9.4.0-snapshot-ddf0d0a.jar:9.4.0-snapshot-ddf0d0a ddf0d0acf4e4443ddea37bb855dead7bed5cc1a2 - runner - 2022-08-09 21:10:22]
    at org.opensearch.index.store.Store$MetadataSnapshot.checksumFromLuceneFile(Store.java:1092) ~[main/:?]
    at org.opensearch.index.store.Store$MetadataSnapshot.loadMetadata(Store.java:1064) ~[main/:?]
    at org.opensearch.index.store.Store$MetadataSnapshot.<init>(Store.java:941) ~[main/:?]
    at org.opensearch.index.store.Store.getMetadata(Store.java:334) ~[main/:?]
    at org.opensearch.indices.replication.common.CopyState.<init>(CopyState.java:52) ~[main/:?]
    at org.opensearch.indices.replication.OngoingSegmentReplications.getCachedCopyState(OngoingSegmentReplications.java:81) ~[main/:?]
    at org.opensearch.indices.replication.OngoingSegmentReplications.prepareForReplication(OngoingSegmentReplications.java:140) ~[main/:?]
    at org.opensearch.indices.replication.SegmentReplicationSourceService$CheckpointInfoRequestHandler.messageReceived(SegmentReplicationSourceService.java:107) ~[main/:?]
    at org.opensearch.indices.replication.SegmentReplicationSourceService$CheckpointInfoRequestHandler.messageReceived(SegmentReplicationSourceService.java:88) ~[main/:?]
    at org.opensearch.transport.RequestHandlerRegistry.processMessageReceived(RequestHandlerRegistry.java:106) ~[main/:?]
    at org.opensearch.transport.InboundHandler$RequestHandler.doRun(InboundHandler.java:453) ~[main/:?]
    at org.opensearch.common.util.concurrent.ThreadContext$ContextPreservingAbstractRunnable.doRun(ThreadContext.java:806) ~[main/:?]
    at org.opensearch.common.util.concurrent.AbstractRunnable.run(AbstractRunnable.java:52) ~[main/:?]
    at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1136) ~[?:?]
    at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:635) ~[?:?]
    at java.lang.Thread.run(Thread.java:833) ~[?:?]

Reproduce

Run below test in contiuation

    public void testDropPrimaryDuringReplication() throws Exception {
        final Settings settings = Settings.builder()
            .put(indexSettings())
            .put(IndexMetadata.SETTING_NUMBER_OF_REPLICAS, 6)
            .put(IndexMetadata.SETTING_REPLICATION_TYPE, ReplicationType.SEGMENT)
            .build();
        final String clusterManagerNode = internalCluster().startClusterManagerOnlyNode();
        final String primaryNode = internalCluster().startDataOnlyNode(Settings.EMPTY);
        createIndex(INDEX_NAME, settings);
        internalCluster().startDataOnlyNodes(6);

        int initialDocCount = scaledRandomIntBetween(100, 200);
        try (
            BackgroundIndexer indexer = new BackgroundIndexer(
                INDEX_NAME,
                "_doc",
                client(),
                -1,
                RandomizedTest.scaledRandomIntBetween(2, 5),
                false,
                random()
            )
        ) {
            indexer.start(initialDocCount);
            waitForDocs(initialDocCount, indexer);
            refresh(INDEX_NAME);
            // don't wait for replication to complete, stop the primary immediately.
            internalCluster().stopRandomNode(InternalTestCluster.nameFilter(primaryNode));
            ensureYellow(INDEX_NAME);

            // start another replica.
            internalCluster().startDataOnlyNode();
            ensureGreen(INDEX_NAME);

            // index another doc and refresh - without this the new replica won't catch up.
            client().prepareIndex(INDEX_NAME).setId("1").setSource("foo", "bar").get();

            flushAndRefresh(INDEX_NAME);
            waitForReplicaUpdate();
            assertSegmentStats(6);
        }
    }

Host/Environment (please complete the following information):

Note: This is different from https://github.com/opensearch-project/OpenSearch/issues/4178 where FileNotFoundException happends due to missing Segment_N file

mch2 commented 2 years ago

Closing - this was fixed with https://github.com/opensearch-project/OpenSearch/pull/4366