opensearch-project / opensearch-migrations

All things migrations and upgrades for OpenSearch
Apache License 2.0
34 stars 27 forks source link

[BUG]Replayer thread hanging while replay from inputStream #536

Open ParvelAWS opened 6 months ago

ParvelAWS commented 6 months ago

What is the bug?

Replayer is correctly consuming an input file stream consists of protobuf messages. Given parameters: --insecure --sigv4-auth-header-service-region "es,us-east-1" -i /tmp/test.pb https://some-os-domain

After played half of the messages in file, the thread hang with: 2024-03-19 21:31:40,280: 200, 200, -19158, 23, 72757efffe6bc65f-00000007-0000fe53-dd7acce1a35b4bf4-2c14bc37.2 [INFO ] 2024-03-19 21:31:40,280 [targetConnectionPool-1-2] TransactionSummaryLogger - 200, 200, -19158, 23, 72757efffe6bc65f-00000007-0000fe53-dd7acce1a35b4bf4-2c14bc37.2 [WARN ] 2024-03-19 21:31:40,289 [targetConnectionPool-1-2] RequestSenderOrchestrator - Scheduled future did not successfully run IndexedChannelInteraction(channelKey=1cc9f27e-0261-44ec-ad0b-437004851862.72757efffe6bc65f-00000007-0000fe53-dd7acce1a35b4bf4-2c14bc37.1, index=2) java.lang.IllegalStateException: stream has already been operated upon or closed at java.base/java.util.stream.AbstractPipeline.spliterator(AbstractPipeline.java:346) ~[?:?] at java.base/java.util.stream.ReferencePipeline.iterator(ReferencePipeline.java:143) ~[?:?] at org.opensearch.migrations.replay.RequestSenderOrchestrator.lambda$scheduleSendOnConnectionReplaySession$40(RequestSenderOrchestrator.java:220) ~[trafficReplayer.jar:?] at org.opensearch.migrations.replay.RequestSenderOrchestrator.lambda$scheduleSendOnConnectionReplaySession$41(RequestSenderOrchestrator.java:228) ~[trafficReplayer.jar:?] at io.netty.util.concurrent.PromiseTask.runTask(PromiseTask.java:98) [netty-common-4.1.100.Final.jar:4.1.100.Final] at io.netty.util.concurrent.ScheduledFutureTask.run(ScheduledFutureTask.java:153) [netty-common-4.1.100.Final.jar:4.1.100.Final] at io.netty.util.concurrent.AbstractEventExecutor.runTask(AbstractEventExecutor.java:173) [netty-common-4.1.100.Final.jar:4.1.100.Final] at io.netty.util.concurrent.AbstractEventExecutor.safeExecute(AbstractEventExecutor.java:166) [netty-common-4.1.100.Final.jar:4.1.100.Final] at io.netty.util.concurrent.SingleThreadEventExecutor.runAllTasks(SingleThreadEventExecutor.java:470) [netty-common-4.1.100.Final.jar:4.1.100.Final] at io.netty.channel.nio.NioEventLoop.run(NioEventLoop.java:566) [netty-transport-4.1.100.Final.jar:4.1.100.Final] at io.netty.util.concurrent.SingleThreadEventExecutor$4.run(SingleThreadEventExecutor.java:997) [netty-common-4.1.100.Final.jar:4.1.100.Final] at io.netty.util.internal.ThreadExecutorMap$2.run(ThreadExecutorMap.java:74) [netty-common-4.1.100.Final.jar:4.1.100.Final] at io.netty.util.concurrent.FastThreadLocalRunnable.run(FastThreadLocalRunnable.java:30) [netty-common-4.1.100.Final.jar:4.1.100.Final] at java.base/java.lang.Thread.run(Thread.java:840) [?:?]

How can one reproduce the bug?

Replay from input file attached. test.pb.zip

What is the expected behavior?

The replayer finished playing all messages in file and exit.

What is your host/environment?

EKS Pod, JDK 17 with AWS Linux 2023

Do you have any screenshots?

No

Do you have any additional context?

Add any other context about the problem.

gregschohn commented 6 months ago

Could you please let me know what release/commit you were running when you experienced this?

gregschohn commented 6 months ago

While I don't see the same stream re-use exception on mainline right now, I do notice that there's pending outstanding work for 72757efffe6bc65f-00000007-0000fe53-dd7acce1a35b4bf4-2c14bc37. The timestamps on the observations of that stream aren't chronological, which causes an exception to fire if -ea is being used and will cause other strange issues when it isn't. Here's the summary of the TrafficStream, notice that these observations are all from one record. My guess is that the clock on the computer may have been adjusted backward, creating the anomaly.

TrafficStream Summary: {72757efffe6bc65f-00000007-0000fe53-dd7acce1a35b4bf4-2c14bc37 (#1)[
2024-03-19T15:49:24.003514406Z: WRITE HTT, 
2024-03-19T15:49:53.919982284Z: READ POS, 
2024-03-19T15:49:53.919982284Z: EOM, 
2024-03-19T15:49:56.407839481Z: READ POS, 
2024-03-19T15:49:56.407839481Z: EOM, 
2024-03-19T15:49:37.248713992Z: WRITE HTT, 
2024-03-19T15:49:37.248951099Z: WRITE 0, 
2024-03-19T15:49:57.257858646Z: READ POS, 
2024-03-19T15:49:57.257858646Z: EOM, 
2024-03-19T15:48:58.492549519Z: READ GET, 
2024-03-19T15:48:58.492549519Z: EOM, 
2024-03-19T15:48:58.795272298Z: WRITE HTT, 
2024-03-19T15:48:58.795332286Z: WRITE 0, 
2024-03-19T15:49:58.797475961Z: CLOSE]}
ParvelAWS commented 6 months ago

Sorry the commit i was using is a bit out-of-sync. commit 63ccf853ed2d16f251c78d9a8897c0907477eaca

I cannot reproduce this anymore. It may be related to the target domain availability as well.

Could you please let me know what release/commit you were running when you experienced this?

AndreKurait commented 6 months ago

I've encountered this

[WARN ] 2024-04-01 20:29:40,756 [targetConnectionPool-3-1] RequestSenderOrchestrator - Scheduled future did not successfully run IndexedChannelInteraction(channelKey=d9f9d3ac-a02c-46aa-b842-f67320505052.027e57fffee80927-0000000a-00001185-649b0a3562694922-98a320e2.1143\|partition=0\|offset=295150, index=7423)
--
java.lang.IllegalStateException: stream has already been operated upon or closed
at java.base/java.util.stream.AbstractPipeline.spliterator(Unknown Source) ~[?:?]
at java.base/java.util.stream.ReferencePipeline.iterator(Unknown Source) ~[?:?]
at org.opensearch.migrations.replay.RequestSenderOrchestrator.lambda$scheduleSendRequestOnConnectionReplaySession$17(RequestSenderOrchestrator.java:149) ~[trafficReplayer.jar:?]
at io.netty.util.concurrent.PromiseTask.runTask(PromiseTask.java:98) [netty-common-4.1.100.Final.jar:4.1.100.Final]
at io.netty.util.concurrent.ScheduledFutureTask.run(ScheduledFutureTask.java:153) [netty-common-4.1.100.Final.jar:4.1.100.Final]
at io.netty.util.concurrent.AbstractEventExecutor.runTask(AbstractEventExecutor.java:173) [netty-common-4.1.100.Final.jar:4.1.100.Final]
at io.netty.util.concurrent.AbstractEventExecutor.safeExecute(AbstractEventExecutor.java:166) [netty-common-4.1.100.Final.jar:4.1.100.Final]
at io.netty.util.concurrent.SingleThreadEventExecutor.runAllTasks(SingleThreadEventExecutor.java:470) [netty-common-4.1.100.Final.jar:4.1.100.Final]
at io.netty.channel.nio.NioEventLoop.run(NioEventLoop.java:566) [netty-transport-4.1.100.Final.jar:4.1.100.Final]
at io.netty.util.concurrent.SingleThreadEventExecutor$4.run(SingleThreadEventExecutor.java:997) [netty-common-4.1.100.Final.jar:4.1.100.Final]
at io.netty.util.internal.ThreadExecutorMap$2.run(ThreadExecutorMap.java:74) [netty-common-4.1.100.Final.jar:4.1.100.Final]
at io.netty.util.concurrent.FastThreadLocalRunnable.run(FastThreadLocalRunnable.java:30) [netty-common-4.1.100.Final.jar:4.1.100.Final]
at java.base/java.lang.Thread.run(Unknown Source) [?:?]