MonoVertex pods often unhealthy in e2e test

juliev0 commented 1 month ago

Describe the bug I'm not sure that this is any issue on our side, but would be worth investigating. Ultimately, could be something to hand over to Numaflow team to look at after some analysis on our side.

I was seeing that the MonoVertex pod was in a crash loop at the very end of the e2e test. I'm not sure if it's consistent or not, but I've seen it more than once. (Perhaps it's okay and it eventually fixes itself?)

This is the CI log from the test I ran locally: ci.log.txt These are the outputs from tests/e2e/outputs directory: output.zip

If you look at outputs/resources/monovertexrollouts/pods you can see many Pods in there, which seems to indicate that the Pods restarted a lot.

To Reproduce Steps to reproduce the behavior:

DATA_LOSS_PREVENTION=true make start
DATA_LOSS_PREVENTION=true make test-e2e

I assume this also happens for DATA_LOSS_PREVENTION=false, but I didn't try it.

Message from the maintainers:

Impacted by this bug? Give it a 👍. We often sort issues this way to know what to prioritize.

juliev0 commented 1 month ago

Hey @chandankumar4 - I unassigned this from you. Instead, I'll try running it again and since Sidhant has now run our e2e himself locally, he could be the one to look at it if it's occurring.

juliev0 commented 1 month ago

Just re-ran this locally. It's after we update the MonoVertexRollout that the Monovertex is in a crash loop with this error:

jvogelman@macos-VF3V14X2QJ controller % k logs test-monovertex-rollout-mv-0-x4p8j  
2024-10-15T16:03:04.621585Z  INFO monovertex::server_info: Server info file: ServerInfo { protocol: "uds", language: "java", minimum_numaflow_version: "", version: "0.6.0", metadata: Some({}) }
2024-10-15T16:03:04.623577Z  INFO monovertex::server_info: Version_info: VersionInfo { version: "latest+unknown", build_date: "1970-01-01T00:00:00Z", git_commit: "", git_tag: "", git_tree_state: "", go_version: "unknown", compiler: "", platform: "linux/x86_64" }
2024-10-15T16:03:04.623761Z  WARN monovertex::server_info: Failed to get the minimum numaflow version, skipping numaflow version compatibility check
2024-10-15T16:03:04.625997Z  WARN monovertex::startup: Error waiting for source server info file: ServerInfoError("SDK version 0.6.0 must be upgraded to at least 0.8.0, in order to work with the current numaflow version")
2024-10-15T16:03:04.626288Z ERROR monovertex: Application error: ForwarderError("Error waiting for server info file")
2024-10-15T16:03:04.626458Z  INFO monovertex: Gracefully Exiting...

juliev0 commented 1 month ago

Hey @dpadhiar - not super high priority, but would be good to fix the e2e test so that after updating MonoVertexRollout, the MonoVertex Pod is not in a crash loop (see log above)

dpadhiar commented 1 month ago

Hey @dpadhiar - not super high priority, but would be good to fix the e2e test so that after updating MonoVertexRollout, the MonoVertex Pod is not in a crash loop (see log above)

I see, looks like the version I change the upgrade to (from stable to 0.6.0) causes an issue. Will change that soon.

numaproj / numaplane

MonoVertex pods often unhealthy in e2e test #314