numaproj / numaplane

Control Plane for Numaproj
Apache License 2.0
8 stars 5 forks source link

MonoVertex pods often unhealthy in e2e test #314

Closed juliev0 closed 3 days ago

juliev0 commented 2 weeks ago

Describe the bug I'm not sure that this is any issue on our side, but would be worth investigating. Ultimately, could be something to hand over to Numaflow team to look at after some analysis on our side.

I was seeing that the MonoVertex pod was in a crash loop at the very end of the e2e test. I'm not sure if it's consistent or not, but I've seen it more than once. (Perhaps it's okay and it eventually fixes itself?)

This is the CI log from the test I ran locally: ci.log.txt These are the outputs from tests/e2e/outputs directory: output.zip

If you look at outputs/resources/monovertexrollouts/pods you can see many Pods in there, which seems to indicate that the Pods restarted a lot.

To Reproduce Steps to reproduce the behavior:

  1. DATA_LOSS_PREVENTION=true make start
  2. DATA_LOSS_PREVENTION=true make test-e2e

I assume this also happens for DATA_LOSS_PREVENTION=false, but I didn't try it.


Message from the maintainers:

Impacted by this bug? Give it a 👍. We often sort issues this way to know what to prioritize.

juliev0 commented 3 days ago

Hey @chandankumar4 - I unassigned this from you. Instead, I'll try running it again and since Sidhant has now run our e2e himself locally, he could be the one to look at it if it's occurring.

juliev0 commented 3 days ago

Just re-ran this locally. It's after we update the MonoVertexRollout that the Monovertex is in a crash loop with this error:

jvogelman@macos-VF3V14X2QJ controller % k logs test-monovertex-rollout-mv-0-x4p8j  
2024-10-15T16:03:04.621585Z  INFO monovertex::server_info: Server info file: ServerInfo { protocol: "uds", language: "java", minimum_numaflow_version: "", version: "0.6.0", metadata: Some({}) }
2024-10-15T16:03:04.623577Z  INFO monovertex::server_info: Version_info: VersionInfo { version: "latest+unknown", build_date: "1970-01-01T00:00:00Z", git_commit: "", git_tag: "", git_tree_state: "", go_version: "unknown", compiler: "", platform: "linux/x86_64" }
2024-10-15T16:03:04.623761Z  WARN monovertex::server_info: Failed to get the minimum numaflow version, skipping numaflow version compatibility check
2024-10-15T16:03:04.625997Z  WARN monovertex::startup: Error waiting for source server info file: ServerInfoError("SDK version 0.6.0 must be upgraded to at least 0.8.0, in order to work with the current numaflow version")
2024-10-15T16:03:04.626288Z ERROR monovertex: Application error: ForwarderError("Error waiting for server info file")
2024-10-15T16:03:04.626458Z  INFO monovertex: Gracefully Exiting...
juliev0 commented 3 days ago

Hey @dpadhiar - not super high priority, but would be good to fix the e2e test so that after updating MonoVertexRollout, the MonoVertex Pod is not in a crash loop (see log above)

dpadhiar commented 3 days ago

Hey @dpadhiar - not super high priority, but would be good to fix the e2e test so that after updating MonoVertexRollout, the MonoVertex Pod is not in a crash loop (see log above)

I see, looks like the version I change the upgrade to (from stable to 0.6.0) causes an issue. Will change that soon.