nrdg / cloudknot

A python library to run your existing code on AWS Batch
https://nrdg.github.io/cloudknot/
Other
70 stars 17 forks source link

Github Actions - much simpler version #213

Closed arokem closed 4 years ago

arokem commented 4 years ago

This is an attempt to disentangle the issues we see on #208, by changing only the test workflow config to include AWS credentials. Once we have this pegged down, we can try to add the other things that #208 adds.

arokem commented 4 years ago

Nope. No joy. Checking this locally, only docker_reqs_ref_data gets installed into the cloudknot/data folder when running this with python -m pip install .[dev]

arokem commented 4 years ago

And it looks like installing with python -m pip install -e .[dev] does have the right data, but fails later with another error.

arokem commented 4 years ago

Looks like the package data issue is resolved! 🎉

I think the remaining errors have to do with AWS permissions. Need to look at the logs to ascertain.

richford commented 4 years ago

The cloudknot-ci role on AWS has the ec2:CreateDefaultVpc permission (see line 84 here). Is this the role assumed by the github action?

arokem commented 4 years ago

I added that role earlier today and force-pushed. I think that failure is now resolved. Still two remaining that I will look into later today.

arokem commented 4 years ago

Looks like only one failure remaining. It's vexing, because there is no trace of this failure in the AWS console (as far as I can tell).

richford commented 4 years ago

Okay, I think I understand. This is an bug in the tests. In test_pars_with_default_vpc, we supply role names that already exist in other stacks from other tests. By the time we get to test_pars_with_default_vpc, those other tests have presumably completed and passed, initiating stack deletion on exit from the test. But the stack deletion takes a while to complete and when test_pars_with_default_vpc starts, the roles haven't been deleted yet. So we're getting an error that essentially says "hey, you can't create a role with that name because it already exists in another stack."

Fixing this might be as simple as changing the service role names. I'll suggest an edit to this PR to that effect. If it fixes everything, then YAY! If not, I think we should open another issue to fix the test and merge this PR.

richford commented 4 years ago

Aha, okay we fixed one error and now have another. This time, we've exceeded the max number of VPCs per region. I remember that we had the same problem on the eScience account.

See https://docs.aws.amazon.com/vpc/latest/userguide/amazon-vpc-limits.html

I recommend that we merge this PR and open a new issue for the VPC limit, linking as well to #22 to avoid testing on AWS entirely.

arokem commented 4 years ago

OK. Let's do it!

arokem commented 4 years ago

We can ask to increase our VPC quota. Might help

arokem commented 4 years ago

We might be running into issues when resources with the same name get created in quick succession by tests. I am seeing this in CloudFormation:

Screen Shot 2020-07-03 at 5 15 19 PM

Is that -1 necessary? Can we do -{random_uuid} instead?

arokem commented 4 years ago

For the time being, I've asked to increase our quota to 25 VPCs per region