Migrate containers to ARM

gaurijo commented 11 months ago

Are the docker containers currently running on x86_64?

I'm happy to get started on this issue - please let me know if there is additional context I should have first

adolski commented 11 months ago

They are--there are three main steps to this:

Update the Book Tracker's task definitions to specify ARM--task definitions are an ECS concept that basically specify the resources and so on that a task uses when it runs. We use Terraform to manage our AWS resources as code and we keep our scripts in separate repositories:
1. https://github.com/UIUCLibrary/aws-book-tracker-demo-service
2. https://github.com/UIUCLibrary/aws-book-tracker-prod-service
The rails-container-scripts submodule used to build only x86 images until I changed it to dual-build x86 and ARM images. We would want to change it again to build only ARM images, which would basically just require reverting those changes, I think. This step is really optional, but if we aren't using x86 images anymore then we shouldn't build them, because it makes the build take longer.
Because Metaslurp also uses rails-container-scripts, we would need to change its task definitions as well and rebuild/redeploy it.

This issue would be a good way to learn more about how ECS and Terraform work. It's small, but deep.

gaurijo commented 11 months ago

Thanks. I found this resource for working with ARM workloads in AWS, and they lay out several ways to configure ARM CPU architecture for ECS task definitions (including using aws cli).

So I'd essentially want to include something like:

{
    "runtimePlatform": {
        "operatingSystemFamily": "LINUX",
        "cpuArchitecture": "ARM64"
    },
...
}

But instead of using aws cli or a different interface to configure the task definitions, I'd want to change/add a resource via Terraform script(s) in the repos you linked. Do I have that right?

adolski commented 11 months ago

That is correct. I believe the section of the terraform script that needs to be changed is the aws_ecs_task_definition in main.tf: https://registry.terraform.io/providers/hashicorp/aws/latest/docs/resources/ecs_task_definition#cpu_architecture

gaurijo commented 11 months ago

I cloned both repos and installed terraform via homebrewon my machine. Do I need to run any commands before I update any code? ie:

$ terraform init

Right now if I run that I get the following error:

Error: validating provider credentials: retrieving caller identity from STS: operation error STS: GetCallerIdentity, https response error StatusCode: 403, RequestID: 6dff0062-21a9-407b-8ccf-6bdaa6fe4b46, api error ExpiredToken: The security token included in the request is expired

adolski commented 11 months ago

Are you aws loginned?

I shared a Box folder containing secrets.tfvars files you'll need.

I haven't used Terraform myself in a long time, but terraform init sounds right. After that, terraform plan will show you what it's going to do, and terraform apply will do it.

If terraform plan shows a lot of changes, that probably means that the scripts are out of sync with the resources in AWS. But hopefully that isn't the case.

gaurijo commented 11 months ago

Edited to Add:

I was able to get terraform init command to work by running rm -f .terraform.lock.hcl, and then running terraform init:

After making sure I was logged in with aws and included the secrets.tfvarsfile,

When I run terraform init I get:

Initializing the backend...

Successfully configured the backend "s3"! Terraform will automatically
use this backend unless the backend configuration changes.

Initializing provider plugins...
- Reusing previous version of hashicorp/aws from the dependency lock file
- Reusing previous version of hashicorp/template from the dependency lock file
- Installing hashicorp/aws v4.52.0...
- Installed hashicorp/aws v4.52.0 (signed by HashiCorp)

Error: Incompatible provider version

Provider registry.terraform.io/hashicorp/template v2.2.0 does not have a package available for your current platform, darwin_arm64.

Provider releases are separate from Terraform CLI releases, so not all providers are available for all platforms. Other versions of
this provider may have different platforms supported.

I looked into this issue and found a solution that worked for others who received the same error, and ran the following commands:

arch -arm64 brew install kreuzwerker/taps/m1-terraform-provider-helper m1-terraform-provider-helper activate m1-terraform-provider-helper install hashicorp/template -v v2.2.0

which then gave me:

Successfully installed hashicorp/template v2.2.0

When I run terraform init again, I get:

Error: Failed to install provider

Error while installing hashicorp/template v2.2.0: the local package for registry.terraform.io/hashicorp/template 2.2.0 doesn't match any of the checksums previously recorded in the dependency lock file (this might be because the available checksums are for packages targeting different platforms)

So then I ran this command:

terraform providers lock -platform=linux_amd64 -platform=darwin_amd64

This gave me a successful message:

Fetching hashicorp/template 2.2.0 for linux_amd64...

Retrieved hashicorp/template 2.2.0 for linux_amd64 (signed by HashiCorp)

Fetching hashicorp/aws 5.21.0 for linux_amd64...

Retrieved hashicorp/aws 5.21.0 for linux_amd64 (signed by HashiCorp)

Fetching hashicorp/aws 5.21.0 for darwin_amd64...

Retrieved hashicorp/aws 5.21.0 for darwin_amd64 (signed by HashiCorp)

Fetching hashicorp/template 2.2.0 for darwin_amd64...

Retrieved hashicorp/template 2.2.0 for darwin_amd64 (signed by HashiCorp)

Obtained hashicorp/template checksums for linux_amd64; This was a new provider and the checksums for this platform are now tracked in the lock file

Obtained hashicorp/template checksums for darwin_amd64; This was a new provider and the checksums for this platform are now tracked in the lock file

Obtained hashicorp/aws checksums for linux_amd64; This was a new provider and the checksums for this platform are now tracked in the lock file

Obtained hashicorp/aws checksums for darwin_amd64; This was a new provider and the checksums for this platform are now tracked in the lock file

Success! Terraform has updated the lock file.

Review the changes in .terraform.lock.hcl and then commit to your
version control system to retain the new checksums.

~~But after committing and running terraform init again, I get the same error as before `(Failed to install provider)~~

~~I'll keep digging on how to resolve this, but welcome to any suggestions!~~

gaurijo commented 11 months ago

Two updates:

I sent up a PR for updating the aws demo service using Terraform. If everything looks fine to you, I'll do the same thing for the prod service
I pushed up a new commit to the master branch of rails-container-scripts submodule that removes 'linux/amd64' in the buildx instruction.. If this is not what you had in mind I'll revert back to what it was before and make any necessary changes.

adolski commented 11 months ago

Looking good so far! Have you tried a build & deploy of an ARM Book Tracker image yet?

Let's get the demo environment fully migrated before moving onto production.

gaurijo commented 11 months ago

Not yet! I'll give it a go and get back to you.

gaurijo commented 11 months ago

Demo service keeps failing and starting/stopping.

I first ran the redeploy.sh demo script and saw the following errors in the aws logs:

I thought maybe I needed to build first and then deploy, so I ran docker-build.sh demo followed by ecs-deploy-webapp.sh demo. Everything looked fine on my machine (no errors with building image/pushing to aws). But the deploy failed again and the logs showed:

adolski commented 11 months ago

Okay, I think I know what's wrong. The Book Tracker's task definition actually defines two containers:

The Book Tracker container
A container running Apache+mod_shibboleth which acts as a reverse proxy in front of the Book Tracker and provides the Shibboleth SP

The Book Tracker container is probably fine, but the other one is still x86, thus the "exec format error."

Unfortunately the architecture part of the task definition applies to all of its containers and can't be applied to just one.

So I guess we have a few options:

Stick with x86 for now
Replace omniauth-shibboleth in the Book Tracker with omniauth-saml (will require coordinating with iTrust)
- After that's done then the Apache container can be removed and the ARM migration can proceed.
Rebuild the Apache image for ARM (will require coordinating with Library IT and may break other projects that are relying on it)

I think (2) is the best long-term option and I'm sure that Library IT (who wrote the Apache image builder tool) would appreciate it.

But whether or not we attempt (2) right now, we need to do (1) and revert the changes made thus far.

gaurijo commented 11 months ago

Ahh, okay. I was wondering why I was seeing different error messages than previously in the logs for the shib-frontend container. This makes more sense now.

Confirming I've reverted the changes so now the rails-container-scripts/docker-build.sh includes x86_64 again, and the Terraform script in the aws-book-tracker-demo-service no longer specifies arm64 in the main.tf file (I also re-ran terraform applycommand)

Demo service is back up and running again:

adolski commented 11 months ago

Great! I will hopefully be able to look into the omniauth stuff soon. I've created #30 to track it.

adolski commented 11 months ago

@gaurijo Neither of the Book Trackers are using that x86 container anymore, so you should be able to proceed now.

(Make sure to pull the latest terraform code in aws-book-tracker-demo-service and aws-book-tracker-prod-service)

gaurijo commented 11 months ago

I also pulled down the latest code from develop branch. When I run the tests I'm seeing some errors I hadn't seen before - are these expected or did something go wrong on my end?

adolski commented 11 months ago

I haven't seen those errors before. They would be stemming from this change: https://github.com/medusa-project/book-tracker/issues/32#issuecomment-1779900140

I think I found a bug in TempStore.client_options(). Try pulling the latest code and try again.

gaurijo commented 11 months ago

I pulled down the latest code but I'm still getting the same errors.

On the other hand, I implemented changes to the aws_book_tracker_demo_service script, updated the rails-container-scripts docker buildx command, and the redeploy of demo was successful.

adolski commented 11 months ago

Excellent. You can probably move onto production now.

I don't know what that error is. Can you do either of these:

% bin/rails console
Loading development environment (Rails 7.1.1)
irb(main):001> TempStore.instance.bucket_exists?
=> true

 % bin/rails console -e test
Loading test environment (Rails 7.1.1)
irb(main):001> TempStore.instance.bucket_exists?
=> true

Is minio running and is there anything interesting in the log? Are your config/credentials/development.yml and test.yml files correct, particularly the storage section?

gaurijo commented 11 months ago

I'm able to get a true output with both rails console environments. My config/credentials all seem correct as well.

When I try accessing minio however, I get blocked with the following error:

gaurijo commented 11 months ago

I'll keep digging around why these tests are failing on my end, but going to close this issue since everything is migrated now

medusa-project / book-tracker

Migrate containers to ARM #27