scylladb / scylla-machine-image

Apache License 2.0
18 stars 25 forks source link

AWS: add snapshot_group:all to builder config #467

Closed PhipBlitch-Arcadia closed 11 months ago

PhipBlitch-Arcadia commented 11 months ago

I’m having an an issue with not being able to copy the official scylla AMIs in AWS. Attempting to copy fails with the error you do not have permission to access the storage of this ami.

I found similar issues on the old scylla-ami repo [1], [2], as well as this description of why it happens.

The snapshot_groups parameter for theamazon-ebs builder was added to solve this, it takes:

A list of groups that have access to create volumes from the snapshot(s). By default no groups have permission to create volumes from the snapshot(s). all will make the snapshot publicly accessible.

I believe adding "snapshot_groups": ["all"] to the aws builder config in scylla.json would solve this issue. Could this be done?

fruch commented 11 months ago

Out of curiosity, we do supply the images in almost all regions, why do you need a copy of it ?

I.e. what is the use case ?

PhipBlitch-Arcadia commented 11 months ago

@fruch That's fair. Copying is the simplest use case, that others had asked about before. I suspect in my case copying is unnecessary and I actually only need the snapshot to be accessible. For brevity's sake I didn't get into my tale of woe, but since you asked:

Our clusters were built using CentOS AMIs. Rather than manage a heterogenous mixture of Ubuntu and CentOS, OR incur the considerable cost to fully migrate, we start instances with the last CentOS AMI (4.4.8) and update them prior to bootstrapping.

Use case 1: The 4.4.8 AMI has a depreciation date of Dec 22 2023. Being unable to make a private copy of it means all public CentOS-based Scylla AMIs will disappear after that date.

Running scylla-machine-image to build a private image is an option, but CentOS 7 End of Life is only 6 months after that. I need a way to get our clusters migrated to Ubuntu. The obvious method is to replace one node at a time, streaming its replacement. That approach has several downsides: it would be very slow, cause a protracted period of very high load, involve nonzero risk turning over every node while on-line, and the data transfer costs would be enormous.

Use case 2: AWS has a process to replace the root volume of an instance on-line while retaining data in instance volumes. Today it fails without explanation with the Scylla AMIs, but I believe this change would allow it to succeed. If replacing a CentOS root volume with an Ubuntu one were that easy, a node would only require restoring the scylla.yaml, recreating the volume mount, and ensuring other image setup steps are completed before returning to service.

yaronkaikov commented 11 months ago

@PhipBlitch-Arcadia Are you asking to change this for Centos based AMIs ?

fruch commented 11 months ago

@PhipBlitch-Arcadia thanks for sharing the information, we do want to hear those tales.

@fruch That's fair. Copying is the simplest use case, that others had asked about before. I suspect in my case copying is unnecessary and I actually only need the snapshot to be accessible. For brevity's sake I didn't get into my tale of woe, but since you asked:

Our clusters were built using CentOS AMIs. Rather than manage a heterogenous mixture of Ubuntu and CentOS, OR incur the considerable cost to fully migrate, we start instances with the last CentOS AMI (4.4.8) and update them prior to bootstrapping.

Use case 1: The 4.4.8 AMI has a depreciation date of Dec 22 2023. Being unable to make a private copy of it means all public CentOS-based Scylla AMIs will disappear after that date.

Running scylla-machine-image to build a private image is an option, but CentOS 7 End of Life is only 6 months after that. I need a way to get our clusters migrated to Ubuntu. The obvious method is to replace one node at a time, streaming its replacement. That approach has several downsides: it would be very slow, cause a protracted period of very high load, involve nonzero risk turning over every node while on-line, and the data transfer costs would be enormous.

@mykaul FYI, maybe a use-case we should be thinking on (and testing), before AMIs are deleted

Use case 2: AWS has a process to replace the root volume of an instance on-line while retaining data in instance volumes. Today it fails without explanation with the Scylla AMIs, but I believe this change would allow it to succeed. If replacing a CentOS root volume with an Ubuntu one were that easy, a node would only require restoring the scylla.yaml, recreating the volume mount, and ensuring other image setup steps are completed before returning to service.

An interesting case, I can tell you for sure, we are not testing this kind of procedure, so access to copy the volume is the is probably the first of many things you might have to deal with.

would love to hear the end of this story :)

fruch commented 11 months ago

@PhipBlitch-Arcadia Are you asking to change this for Centos based AMIs ?

@yaronkaikov according what @PhipBlitch-Arcadia shared, I believe the request if for all AMIs, and also backwards to all released AMIs ( use case no.1 )

I think in general is a fair thing to ask for, and should be complex to add (to change all releases is a bit more painful) even that we would prefer people running and upgrading to newer versions, and if there are things from scylla's end that prevent those upgrades, we would also like to know.

yaronkaikov commented 11 months ago

@PhipBlitch-Arcadia Are you asking to change this for Centos based AMIs ?

@yaronkaikov according what @PhipBlitch-Arcadia shared, I believe the request if for all AMIs, and also backwards to all released AMIs ( use case no.1 )

I think in general is a fair thing to ask for, and should be complex to add (to change all releases is a bit more painful) even that we would prefer people running and upgrading to newer versions, and if there are things from scylla's end that prevent those upgrades, we would also like to know.

@fruch AMI 4.4.8 as mentioned, is no longer available, so for sure, we can't do anything there. For future releases, I guess it's possible.

mykaul commented 11 months ago

@PhipBlitch-Arcadia - in Scylla cloud, we always use 'streaming to its replacement' method as you've specified.

PhipBlitch-Arcadia commented 11 months ago

Are you asking to change this for Centos based AMIs ?

@yaronkaikov according what @PhipBlitch-Arcadia shared, I believe the request if for all AMIs, and also backwards to all released AMIs ( use case no.1 )

I think in general is a fair thing to ask for, and should be complex to add (to change all releases is a bit more painful) even that we would prefer people running and upgrading to newer versions, and if there are things from scylla's end that prevent those upgrades, we would also like to know.

@fruch Just having this change going forward in the future released 5.1 and 5.2 images would be a big help, but it would also definitely be helpful to have the retroactive change. If doing everything is too much, just extending the depreciation date or allowing copying on 4.4.8 would mean less pressure on us before December.

@fruch AMI 4.4.8 as mentioned, is no longer available, so for sure, we can't do anything there. For future releases, I guess it's possible.

@yaronkaikov At least in us-east-1, 4.4.8 is still available as ami-0ec5eb1ff7b5d3ae5. It's the last I'm aware of that is CentOS-based.

PhipBlitch-Arcadia commented 11 months ago

@PhipBlitch-Arcadia - in Scylla cloud, we always use 'streaming to its replacement' method as you've specified.

@mykaul That's what we do for a single instance retirements. But all at once... 😰 How did Scylla cloud handle the transition to Ubuntu? Was there any kind of replacement effort or do both types coexist until they're replaced naturally?

fruch commented 11 months ago

@PhipBlitch-Arcadia - in Scylla cloud, we always use 'streaming to its replacement' method as you've specified.

@mykaul That's what we do for a single instance retirements. But all at once... 😰 How did Scylla cloud handle the transition to Ubuntu? Was there any kind of replacement effort or do both types coexist until they're replaced naturally?

We regularly do upgrades on scylla cloud, In most cases it's an in-place upgrade, using rpm/deb.

But adding a new node and decomission the older is quite common as well for lots of reasons, it's a bit slower to stream all the data around, and can have a small effect on the cluster, but as long as the cluster isn't overloaded it should work fine.

Also a mixed cluster with centos and ubuntu should work in theory, even that we don't ever test it.

mykaul commented 11 months ago

Considering that ~10% of AWS nodes fail yearly, we regularly replace nodes. And thus we stream.

I don't think we should support this use case.

PhipBlitch-Arcadia commented 11 months ago

@mykaul For a large fleet of Scylla clusters, there is a substantial difference between 10% in a year and 100% in a month.

I'm not asking for you to modify all your old AMIs or to support my specific use case. Those would be welcome, but my request was intended to be as small as possible: for future public AMIs to have their underlying snapshots also be public. I believe that could be accomplished by adding a directive to the amazon-ebs builder config.