openedx-unsupported / edx-analytics-configuration

GNU Affero General Public License v3.0
8 stars 28 forks source link

Randomize VPC subnets in the hopes of AWS using more than one #80

Closed bmedx closed 5 years ago

bmedx commented 5 years ago

@pwnage101 We have no visibility into how AWS is selecting the subnet, but it seems to only (or at least extremely frequently) be selecting one. This seems to be the only thing we can do to impact that process and was easy enough to try.

brianhw commented 5 years ago

According to AWS doc: "Choose one subnet (Availability Zone) or a range. Amazon EMR provisions capacity in the Availability Zone that is the best fit." I'm interpreting this to mean that AWS will choose the subnet that minimizes cost. While it's not straight-forward to look at spot prices for EMR anymore, one can look at EC2 spot prices at https://console.aws.amazon.com/ec2sp/v1/spot/home?region=us-east-1#. There, for example, it shows that most AZs are around the same cost, except for us-east-1b which is maybe 40% higher.

So I have to say that I'm disappointed that EMR's subnet choice doesn't take into account the IP-address capacity of the subnet. In the old days, we manually tweaked which job was running on which subnet, to try to stay within the 256-instance-per-subnet limit. With a shift to larger instances, we found we needed fewer instances overall, and haven't run into a subnet limitation in a long while.

That said, it surprises me that we're hitting it now. Are there a lot of new jobs that have come on line recently? And do they use a lot of instances?

bmedx commented 5 years ago

That said, it surprises me that we're hitting it now. Are there a lot of new jobs that have come on line recently? And do they use a lot of instances?

I'm not sure how I would find when jobs came online, but we have been adding Snowflake jobs, Affiliate Window, etc. Given that we're running into cost alarms on the EMR reporter job I don't think it's too big of a leap that we're running more clusters and/or running them longer. I'll merge it and monitor to see what happens.