nchammas / flintrock

A command-line tool for launching Apache Spark clusters.
Apache License 2.0
637 stars 116 forks source link

Locked out at random. Never to recover. #357

Closed dorienh closed 1 year ago

dorienh commented 1 year ago

It has happened to me multiple times that I am working with a cluster, all goes well for many hours. Then I ~/hadoop/sbin/stop-all.sh and type 'exit'. The next day, flintrock login times out, also ssh times out. I have to throw away the cluster and reinstall.

Is there any reason this could happen? I have already rebooted the instances in the EC2 dashboard. Is it locking out my ssh?

I update my .aws credentials in ~/.aws/credentials (using aws academy account).

nchammas commented 1 year ago

Are you seeing in the AWS Console that the instances (especially the master instance) are still online when you try to SSH back in?

If SSH is timing out, then it seems like either the instance is down, or the security group rules changed somehow, or the SSH keys were changed somehow. There could be other possibilities of course, but of those three I feel like the first is most likely so that's where I would start.

dorienh commented 1 year ago

Yes I can see in the aws console that the instances are still running.

I just rebooted my Mac and somehow I could login again. Not sure if the ssh was blocking me or so? Hope it doesn't happen again.

nchammas commented 1 year ago

Perhaps it's something weird happening with your network? Anyway, feel free to reopen this issue if you have more resolution on where the problem might be coming from.

dorienh commented 1 year ago

Actually it came back the next day again, can't fix it now. I am at university on Eduroam, also tried mobile hotspot. Same thing.

Even if I login with ssh it times out. My instances are running and 2/2 checks passed.

When I flintrock launch cluster8

It now gives me:

opt/homebrew/lib/python3.10/site-packages/paramiko/transport.py:219: CryptographyDeprecationWarning: Blowfish has been deprecated
  "class": algorithms.Blowfish,
2023-03-31 09:28:20,913 - flintrock.ec2       - INFO  - Launching 4 instances...
2023-03-31 09:28:37,173 - flintrock.ec2       - DEBUG - 4 instances not in state 'running': 'i-09436562527109c97', 'i-09688b6b7e02ff6c0', 'i-0386eb57946f277df', ...
2023-03-31 09:28:42,171 - flintrock.ec2       - DEBUG - 4 instances not in state 'running': 'i-09436562527109c97', 'i-0386eb57946f277df', 'i-073c7c86259375c10', ...
2023-03-31 09:28:46,037 - flintrock.ec2       - DEBUG - 4 instances not in state 'running': 'i-09436562527109c97', 'i-0386eb57946f277df', 'i-073c7c86259375c10', ...

endlessly...

nchammas commented 1 year ago

What do you see in the console for these instances that don't seem to launch normally?

dorienh commented 1 year ago

Allow me a few days to see if it re-appears. I ended up doing a full reset of the aws environment (was using a learners lab). Any command in particular I should be sure to check in the console?

dorienh commented 1 year ago

It just happened again.

flintrock login cluster
/opt/homebrew/lib/python3.10/site-packages/paramiko/transport.py:219: CryptographyDeprecationWarning: Blowfish has been deprecated
  "class": algorithms.Blowfish,
ssh: connect to host 54.205.234.98 port 22: Operation timed out

My ~/.aws/credentials are up to date.

Looking at the dashbboard I see nothing special. See screenshot here.

This is a fresh cluster. Only logged in once after launch. I setup pydoop/yarn. Then exited. A few days later try to login and it keeps timing out.

Anything else I can check or reports I can generate?

I also tried the AWSSupport-TroubleshootSSH in System Services but that process got stuck and some steps failed:

cda7fc3a-cee3-4f29-9c4c-b04734b91d65 | 1 | assertInstanceIsManagedInstance | aws:assertAwsResourceProperty | Failed | Wed, 05 Apr 2023 01:38:00 GMT | Wed, 05 Apr 2023 01:38:00 GMT
-- | -- | -- | -- | -- | -- | --

3cb6b036-5b86-40ef-a54c-07dade48c315 | 2 | assertAllowOffline | aws:assertAwsResourceProperty | Success | Wed, 05 Apr 2023 01:38:01 GMT | Wed, 05 Apr 2023 01:38:01 GMT

f28f0344-e556-46bd-8634-00f96827e9c6 | 3 | assertActionIsFixAll | aws:assertAwsResourceProperty | Success | Wed, 05 Apr 2023 01:38:01 GMT | Wed, 05 Apr 2023 01:38:02 GMT

0dcc8230-715c-4514-84e3-7d81042ec3bf | 4 | assertSubnetId | aws:assertAwsResourceProperty | Success | Wed, 05 Apr 2023 01:38:02 GMT | Wed, 05 Apr 2023 01:38:02 GMT

b88be25f-d38f-446f-af2d-b6c308912c6d | 5 | describeSourceInstance | aws:executeAwsApi | Success | Wed, 05 Apr 2023 01:38:03 GMT | Wed, 05 Apr 2023 01:38:03 GMT

74f0b54d-c5ee-4ac0-9858-c82efd345bdc | 6 | troubleshootSSHOfflineWithSubnetId | aws:executeAutomation | Failed | Wed, 05 Apr 2023 01:38:03 GMT | Wed, 05 Apr 2023 01:38:12 GMT

f631870d-2d3b-40e7-b2df-0d7d4dc7419a | 7 | installEC2Rescue | aws:runCommand | Pending | - | -

28666d80-ecaa-437f-a668-f2cfece634d2 | 8 | troubleshootSSH | aws:runCommand | Pending | - | -

1822d358-a7e4-4208-a99f-ee1a6b109fbc | 9 | troubleshootSSHOffline | aws:executeAutomation | Pending | - | -
nchammas commented 1 year ago

The lockout after some days smells like a firewall or network issue that's independent of Flintrock. But I'm not sure how to debug this because it could be so many different things. Do you have an AWS admin who can help you investigate?