infinyon / fluvio

Lean and mean distributed stream processing system written in rust and web assembly. Alternative to Kafka + Flink in one.
https://www.fluvio.io/
Apache License 2.0
3.79k stars 503 forks source link

[Bug]: `fluvio cluster resume` stuck in `trying to connect to SC` #3989

Closed ajhunyady closed 1 month ago

ajhunyady commented 4 months ago

I shutdown the machine without closing down the cluster. After restart, I just fluvio cluster resume as instructed by the message in fluvio cluster start, but it is stuck in a loop:

 % fluvio cluster start
πŸ“ Running pre-flight checks
    βœ… Local Fluvio is not installed
    ❌ Check Clean Fluvio Local Installation failed Local Fluvio cluster wasn't deleted. Use 'resume' to resume created cluster or 'delete' before startingπŸ’” Some pre-flight check failed!
Preflight check failed

 % fluvio cluster resume
βœ… Local Fluvio is not installed
πŸŽ‰ All checks passed!
βœ… Local Cluster initialized
πŸ‘€ Profile set
πŸ–₯️  Trying to connect to SC: 127.0.0.1:9003 94 seconds elapsed 

To remediate this, I needed to delete the cluster and create a new one. I also got an error during delete:

% fluvio cluster delete
SPU monitoring socket, can't be removed: No such file or directory (os error 2)
Uninstalled fluvio local components

Environment: Using Mac with Intel CPU.

Note: The message flashes continuously while the spinner updates.

ajhunyady commented 4 months ago

@avikam can you please take a look at this one?

avikam commented 4 months ago

@ajhunyady can you reproduce it with RUST_LOG=debug? specifically, I'd like to know if it happened because of a version discrepancy.

For example, if you had fluvio running, and then restarted your machine and cargo built master before resuming, you might see this:

2024-05-07T21:00:15.505156Z  WARN ... fluvio_cluster::start::common: Current Version 0.11.8-dev is not same as expected: 0.11.7-dev

edit - more context: this may happen because the platform version is loaded from the local-config file, as it's a value that is deduced from the CLI. The start/resume commands verify that the version is as expected. When it's not, the SPU verification fails despite a successful connection, and loops infinitely. I think it's worth addressing it, that is, allow you to be platform version agnostic in the local case (and have a proper warning, instead of an infinite loop) - but I do want to understand first if that is the thing that happened to you

sehz commented 4 months ago

restart logic should check version number and warn user of incompatibility

ajhunyady commented 4 months ago

@avikam the debug log is attached. debug.log

avikam commented 4 months ago

indeed it's the same issue. your log shows:

2024-05-07T13:55:39.967622Z  WARN install_only:launch_sc{host_name="127.0.0.1:9003" port=9003 pb=Indicatiff(ProgressBar)}:try_connect_to_sc{config=FluvioConfig { endpoint: "127.0.0.1:9003", use_spu_local_address: false, tls: Disabled, metadata: {}, client_id: None } platform_version=Version { major: 0, minor: 11, patch: 6 } pb=Indicatiff(ProgressBar)}: fluvio_cluster::start::common: Current Version 0.11.7 is not same as expected: 0.11.6

I would agree with @sehz the solution show be showing a warning to the user, and proceeding with the start/resume process

ajhunyady commented 4 months ago

I actually tried both variants.

% fluvio cluster resume
 βœ… Local Fluvio is not installed
πŸŽ‰ All checks passed!
βœ… Local Cluster initialized
πŸ‘€ Profile set
πŸ–₯️  Trying to connect to SC: 127.0.0.1:9003 19 seconds elapsed /                                                                                            
^C

% fluvio cluster start
πŸ“ Running pre-flight checks
    βœ… Local Fluvio is not installed
    ❌ Check Clean Fluvio Local Installation failed Local Fluvio cluster wasn't deleted. Use 'resume' to resume created cluster or 'delete' before starting
    πŸ’” Some pre-flight check failed!
Preflight check failed

I was forced to delete to recover.

ajhunyady commented 4 months ago

Ok, I think I understand what's going on. We are currently updating smartmodules, so I'm constantly upgrading & downgrading to test. Since the smart engine changed, I must restart the cluster every time.

Now I realize that we may actually need a fluvio cluster upgrade command that should do whatever is needed to ensure the new version is compatible & permitted. In this case, nothing metadata-related changed, so the upgrade should ensure that the "previously saved cluster" accepts the new version. In other cases, a proper upgrade may be needed, but the upgrade translation should be handled by the developer who changed the metadata.

The current fluvio cluster upgrade only handles K8.

@sehz @avikam
Did I get this right?

sehz commented 4 months ago

Yup. restart should check to ensure nothing changed. if does, should ask user to perform upgrade command

avikam commented 4 months ago

Iv'e been thinking to address it in two PRs:

  1. stopgap: add a pre-check to the resume command to verify the versions match. This will remediate the bad user experience of looping forever.
  2. refactor upgrade, so that in the local installation type, instead of shut-down and start (which is currently broken, because we must use "resume"), we will update the platform-version only and resume the cluster.

an alternative for (2) is to remove the entire local-config file before starting the cluster. However, we'd lose configuration like number of SPUs, TLS policy, logs, etc. @ajhunyady , @digikata does that sound reasonable?

sehz commented 4 months ago

Idea of having local-config is sound (although we have something similar in K8 config, so they should be consistent or converge).

First should have anyway as required pre-flight check.

Then figure out process of upgrade as this needs to be common across multiple cluster types

digikata commented 4 months ago
  1. definitely sounds good as a stopgap.

  2. sounds like good mechanics for an upgrade of the local cluster, and we can also figure out if there is a commonality between a local start upgrade and k8.

avikam commented 4 months ago

here is 1 before bats: https://github.com/infinyon/fluvio/pull/3999

seems like there is an alignment so I'll keep pursuing that

digikata commented 1 month ago

fixed w/ PR #4138