Closed ajhunyady closed 1 month ago
@avikam can you please take a look at this one?
@ajhunyady can you reproduce it with RUST_LOG=debug? specifically, I'd like to know if it happened because of a version discrepancy.
For example, if you had fluvio running, and then restarted your machine and cargo built master before resuming, you might see this:
2024-05-07T21:00:15.505156Z WARN ... fluvio_cluster::start::common: Current Version 0.11.8-dev is not same as expected: 0.11.7-dev
edit - more context: this may happen because the platform version is loaded from the local-config file, as it's a value that is deduced from the CLI. The start/resume commands verify that the version is as expected. When it's not, the SPU verification fails despite a successful connection, and loops infinitely. I think it's worth addressing it, that is, allow you to be platform version agnostic in the local case (and have a proper warning, instead of an infinite loop) - but I do want to understand first if that is the thing that happened to you
restart logic should check version number and warn user of incompatibility
indeed it's the same issue. your log shows:
2024-05-07T13:55:39.967622Z WARN install_only:launch_sc{host_name="127.0.0.1:9003" port=9003 pb=Indicatiff(ProgressBar)}:try_connect_to_sc{config=FluvioConfig { endpoint: "127.0.0.1:9003", use_spu_local_address: false, tls: Disabled, metadata: {}, client_id: None } platform_version=Version { major: 0, minor: 11, patch: 6 } pb=Indicatiff(ProgressBar)}: fluvio_cluster::start::common: Current Version 0.11.7 is not same as expected: 0.11.6
I would agree with @sehz the solution show be showing a warning to the user, and proceeding with the start/resume process
I actually tried both variants.
% fluvio cluster resume
β
Local Fluvio is not installed
π All checks passed!
β
Local Cluster initialized
π€ Profile set
π₯οΈ Trying to connect to SC: 127.0.0.1:9003 19 seconds elapsed /
^C
% fluvio cluster start
π Running pre-flight checks
β
Local Fluvio is not installed
β Check Clean Fluvio Local Installation failed Local Fluvio cluster wasn't deleted. Use 'resume' to resume created cluster or 'delete' before starting
π Some pre-flight check failed!
Preflight check failed
I was forced to delete to recover.
Ok, I think I understand what's going on. We are currently updating smartmodules, so I'm constantly upgrading & downgrading to test. Since the smart engine changed, I must restart the cluster every time.
Now I realize that we may actually need a fluvio cluster upgrade
command that should do whatever is needed to ensure the new version is compatible & permitted. In this case, nothing metadata-related changed, so the upgrade should ensure that the "previously saved cluster" accepts the new version. In other cases, a proper upgrade may be needed, but the upgrade translation should be handled by the developer who changed the metadata.
The current fluvio cluster upgrade
only handles K8.
@sehz @avikam
Did I get this right?
Yup. restart
should check to ensure nothing changed. if does, should ask user to perform upgrade
command
Iv'e been thinking to address it in two PRs:
an alternative for (2) is to remove the entire local-config file before starting the cluster. However, we'd lose configuration like number of SPUs, TLS policy, logs, etc. @ajhunyady , @digikata does that sound reasonable?
Idea of having local-config
is sound (although we have something similar in K8 config, so they should be consistent or converge).
First should have anyway as required pre-flight check.
Then figure out process of upgrade as this needs to be common across multiple cluster types
definitely sounds good as a stopgap.
sounds like good mechanics for an upgrade of the local cluster, and we can also figure out if there is a commonality between a local start upgrade and k8.
here is 1 before bats
:
https://github.com/infinyon/fluvio/pull/3999
seems like there is an alignment so I'll keep pursuing that
fixed w/ PR #4138
I shutdown the machine without closing down the cluster. After restart, I just
fluvio cluster resume
as instructed by the message influvio cluster start
, but it is stuck in a loop:To remediate this, I needed to delete the cluster and create a new one. I also got an error during delete:
Environment: Using Mac with Intel CPU.
Note
: The message flashes continuously while the spinner updates.