Support [rolling] upgrade of HDFS

sbernauer commented 1 year ago

As of 23.4, when you upgrade your hdfs, e.g. 3.2.2 -> 3.3.4 you run into the error

2023-06-15 06:21:23,060 ERROR namenode.NameNode (NameNode.java:main(1839)) - Failed to start namenode.                                                                                 
java.io.IOException:                                                                                                                                                                   
File system image contains an old layout version -65.                                                                                                                                  
An upgrade to version -66 is required.                                                                                                                                                 
Please restart NameNode with the "-rollingUpgrade started" option if a rolling upgrade is already started; or restart NameNode with the "-upgrade" option to start a new upgrade.

Ideally we should start a rolling upgrade of all components. Currently you simply cannot upgrade your HDFS without hacking stuff (e.g. no cliOverrides to add -upgrade or similar)

Edit from the past: At least the upgrade 3.3.4 -> 3.3.6 worked

nightkr commented 3 months ago

So, looking at https://hadoop.apache.org/docs/r3.4.0/hadoop-project-dist/hadoop-hdfs/HdfsRollingUpgrade.html and testing locally, it looks like the sequence is roughly:

Run hdfs dfsadmin -rollingUpgrade prepare, this uses Hadoop RPC and can be executed from anywhere
Poll for this to complete, upstream docs suggest hdfs dfsadmin -rollingUpgrade query but this is not suitable for machine consumption (unparseable output, status code is constant), can be queried over JMX and REST as curl $NAMENODE_HTTP_URL/jmx?qry=Hadoop:service=NameNode,name=NameNodeInfo | jq '.beans[0].RollingUpgradeStatus.createdRollbackImages', but we could also (maybe) use the internal Java API
Upgrade JournalNodes, wait for upgrade to complete
Upgrade NameNodes with -rollingUpgrade started
Upgrade DataNodes
Run hdfs dfsadmin -rollingUpgrade finalize, uses Hadoop RPC and can be executed from anywhere
Restart NameNodes without -rollingUpgrade flag

We need to detect whether to enter "upgrade mode", we could do that by storing a .status.deployedProductVersion in the CRD, which we set if unset (for new deployments) or once step 6 is complete. Then we're in upgrade mode if .status.deployedProductVersion != .spec.image.productVersion.

Steps 3-5 could be done by adding a check to the end of the STS apply: if in_rolling_upgrade && !ready { exit_reconcile() } where ready = .status.generation == .status.observedGeneration && .status.availableReplicas == .status.updatedReplicas == .spec.replicas.

Step 7 would be simple enough, happens by leaving "upgrade mode".

Steps 1/2/6 are the big question marks. We could run them from the operator container, exec into an existing namenode Pod, or spawn a dedicated Job. Generally running from the operator seems like a poor idea, both because of needing to bundle a HDFS client+JVM and because the operators don't have Kerberos identities (still need to look into how JMX is affected by this too?). Running as a Job means that we don't rely on picking a single "admin namenode", but creates another asynchronous lifecycle for us to manage.

nightkr commented 3 months ago

Another MVPier option would be to only add an override to do steps 3-5, leaving the dfsadmin steps (1/2/6) to be taken manually.

NickLarsenNZ commented 3 months ago

Was hoping it was as simple as an init container, but it looks like there is some choreography involved (with the "wait for" steps.

I think there should be a "do it for me automatically option", but if there is some clear risk to that, then it should be opt in (eg: demos can opt-in, customers might be more cautious).

Can something in stackablectl here help with said choreography to make the manual steps less of a burden?

nightkr commented 3 months ago

I think there should be a "do it for me automatically option", but if there is some clear risk to that, then it should be opt in (eg: demos can opt-in, customers might be more cautious).

Ultimately, all database upgrades (which is what this is) are risky. I agree that it might make sense to have a safeguard, but we should probably think about that as a platform-wide decision then.

Can something in stackablectl here help with said choreography to make the manual steps less of a burden?

I don't think it'd make much sense. 3-5 comes down to updating the StatefulSets in order, which is managed entirely by the operator. 1/2/6 wouldn't be easier for stackablectl to do than for the operator.

Stackablectl also generally isn't really responsible for modifying stacklets at the moment, and I'd be sad to see that change.

NickLarsenNZ commented 3 months ago

Ah yeah, that makes it more clear.

Ultimately, all database upgrades (which is what this is) are risky.

Sure, but operational tasks can be codified (assuming there are checks at each step to prove it is safe to proceed with the next) and IMO this is what Operators are for. The problem could probably be modeled sufficiently with a Finite State Machine.

Maybe it is a tall order to codify operations like this, but this should be the ultimate platform-wide goal.

nightkr commented 3 months ago

Sure, but operational tasks can be codified (assuming there are checks at each step to prove it is safe to proceed with the next) and IMO this is what Operators are for. The problem could probably be modeled sufficiently with a Finite State Machine.

I mean, yeah. I agree that I'd like to have as much as possible managed by the operator. I'm just not sure HDFS is special enough to warrant its own rules for when upgrades should be allowed.

lfrancke commented 2 months ago

Do we have documentation for this? If so please link it here, if not, why not?

And can you please include a snippet that we can use for the release notes for this?

nightkr commented 2 months ago

Do we have documentation for this? If so please link it here, if not, why not?

The docs are at https://docs.stackable.tech/home/nightly/hdfs/usage-guide/upgrading

And can you please include a snippet that we can use for the release notes for this?

I suppose, "- The Stackable Operator for HDFS now supports upgrading existing HDFS installations" or something like that.

lfrancke commented 2 months ago

Is the functionality specific to 3.3 -> 3.4? And we should include a sentence about this requiring manual work.

nightkr commented 3 weeks ago

Is the functionality specific to 3.3 -> 3.4?

No, the mechanism is generic. One caveat is that it currently takes the pessimistic approach of applying it to any upgrade, so 3.3.4 -> 3.3.6 would also trigger it.

And we should include a sentence about this requiring manual work.

Yeah that's a good point. Hm.

nightkr commented 3 weeks ago

The Stackable Operator for HDFS now supports upgrading existing HDFS installations. This process requires some manual intervention, however.

stackabletech / hdfs-operator

Support [rolling] upgrade of HDFS #362