villasv / aws-airflow-stack

Turbine: the bare metals that gets you Airflow
https://victor.villas/aws-airflow-stack/
MIT License
377 stars 69 forks source link

Error - Connection Refused on clean install via CloudFormation. #142

Closed cgddrd closed 4 years ago

cgddrd commented 4 years ago

Everytime I run the CF stack, the web server seems really ropey. Sometimes I can connect to the Web UI, but other times I simply get a 'Connection refused' error. I'm certain it's not my VPC/subnet/SG configuration (and besides, it's using the base config provided back the stack anyway).

This is really frustrating as I'd love to use this stack. Any help greatly appreciated. Thanks.

villasv commented 4 years ago

By “sometimes” you mean different attempts in the same stack (sometimes the webserver responds) or different deployments (sometimes the stack has a completely unresponsive webserver)?

I’ve never experienced this, but the fact that it’s intermittent indicates that it might be a resource load problem. Maybe you could try choosing a bigger instance type for the webserver, like a t3.medium?

amizzo87 commented 4 years ago

I'm also experiencing this issue. Clean (successful) build from cloudformation; can't access web server...SSH'ing into the machine and it seems airflow service is crashing, but I can't find where the log files are to debug why

villasv commented 4 years ago

Maybe related to this: https://github.com/villasv/aws-airflow-stack/issues/149

I've indeed verified that "sometimes" (once in a few new deployments) that race condition occurs and the services won't start on the scheduler or the webserver.

If you're interested in debugging the initial deployment process done by cloudformation, you can peek at the log files in /var/log/cfn*

cgddrd commented 4 years ago

Hi @villasv - I've been doing a lot of work recently with your turbine stack - which I must say is really excellent!

I believe I've found the cause of the symptoms you describe in #149, and will soon be raising a PR with suggested fixes. There are two key factors I think are at play here:

  1. The launch config for the three tiers (webserver, scheduler and workers) should be using cfn-signal to ensure CFN knows when the User Data script has finished running (this is recommended practice as proposed by AWS).

  2. In my own testing (which there's been a lot of recently!), I weirdly found that cfn-init would sometimes randomly fail with the message Unknown error retrieving SharedCloudInitMetadata., causing Airflow to not be correctly installed. I never really got to the end result of why this happened (even looking through all the logs - cfn-init.log and user-data.log), but weirdly, by explicitly including --configsets default in the call to cfn-init, this problem appears to have stopped happening, and after at least 15 stack re-creations, I've not been able to replicate this issue since (it used to happen every couple of attempts before).

I think this issue is related to this - the fact that the Airflow Webserver service has not been correctly installed/configured, thus the service fails to start leading to the 'Connection Refused' issue.

I'm going to close this issue as I don't believe it is a separate problem, and is instead linked to what I've just been discussing (which I'll raise as it's own PR).

Thanks.

cgddrd commented 4 years ago

P.S. I've also added support for Cloudwatch log monitoring of key installation log files (cfn-init.log, user-data.log) - somewhat related to #124 - which I'll raise in a second PR.