Closed dorienh closed 1 year ago
Are you seeing in the AWS Console that the instances (especially the master instance) are still online when you try to SSH back in?
If SSH is timing out, then it seems like either the instance is down, or the security group rules changed somehow, or the SSH keys were changed somehow. There could be other possibilities of course, but of those three I feel like the first is most likely so that's where I would start.
Yes I can see in the aws console that the instances are still running.
I just rebooted my Mac and somehow I could login again. Not sure if the ssh was blocking me or so? Hope it doesn't happen again.
Perhaps it's something weird happening with your network? Anyway, feel free to reopen this issue if you have more resolution on where the problem might be coming from.
Actually it came back the next day again, can't fix it now. I am at university on Eduroam, also tried mobile hotspot. Same thing.
Even if I login with ssh it times out. My instances are running and 2/2 checks passed.
When I flintrock launch cluster8
It now gives me:
opt/homebrew/lib/python3.10/site-packages/paramiko/transport.py:219: CryptographyDeprecationWarning: Blowfish has been deprecated
"class": algorithms.Blowfish,
2023-03-31 09:28:20,913 - flintrock.ec2 - INFO - Launching 4 instances...
2023-03-31 09:28:37,173 - flintrock.ec2 - DEBUG - 4 instances not in state 'running': 'i-09436562527109c97', 'i-09688b6b7e02ff6c0', 'i-0386eb57946f277df', ...
2023-03-31 09:28:42,171 - flintrock.ec2 - DEBUG - 4 instances not in state 'running': 'i-09436562527109c97', 'i-0386eb57946f277df', 'i-073c7c86259375c10', ...
2023-03-31 09:28:46,037 - flintrock.ec2 - DEBUG - 4 instances not in state 'running': 'i-09436562527109c97', 'i-0386eb57946f277df', 'i-073c7c86259375c10', ...
endlessly...
What do you see in the console for these instances that don't seem to launch normally?
Allow me a few days to see if it re-appears. I ended up doing a full reset of the aws environment (was using a learners lab). Any command in particular I should be sure to check in the console?
It just happened again.
flintrock login cluster
/opt/homebrew/lib/python3.10/site-packages/paramiko/transport.py:219: CryptographyDeprecationWarning: Blowfish has been deprecated
"class": algorithms.Blowfish,
ssh: connect to host 54.205.234.98 port 22: Operation timed out
My ~/.aws/credentials
are up to date.
Looking at the dashbboard I see nothing special. See screenshot here.
This is a fresh cluster. Only logged in once after launch. I setup pydoop/yarn. Then exited. A few days later try to login and it keeps timing out.
Anything else I can check or reports I can generate?
I also tried the AWSSupport-TroubleshootSSH
in System Services but that process got stuck and some steps failed:
cda7fc3a-cee3-4f29-9c4c-b04734b91d65 | 1 | assertInstanceIsManagedInstance | aws:assertAwsResourceProperty | Failed | Wed, 05 Apr 2023 01:38:00 GMT | Wed, 05 Apr 2023 01:38:00 GMT
-- | -- | -- | -- | -- | -- | --
3cb6b036-5b86-40ef-a54c-07dade48c315 | 2 | assertAllowOffline | aws:assertAwsResourceProperty | Success | Wed, 05 Apr 2023 01:38:01 GMT | Wed, 05 Apr 2023 01:38:01 GMT
f28f0344-e556-46bd-8634-00f96827e9c6 | 3 | assertActionIsFixAll | aws:assertAwsResourceProperty | Success | Wed, 05 Apr 2023 01:38:01 GMT | Wed, 05 Apr 2023 01:38:02 GMT
0dcc8230-715c-4514-84e3-7d81042ec3bf | 4 | assertSubnetId | aws:assertAwsResourceProperty | Success | Wed, 05 Apr 2023 01:38:02 GMT | Wed, 05 Apr 2023 01:38:02 GMT
b88be25f-d38f-446f-af2d-b6c308912c6d | 5 | describeSourceInstance | aws:executeAwsApi | Success | Wed, 05 Apr 2023 01:38:03 GMT | Wed, 05 Apr 2023 01:38:03 GMT
74f0b54d-c5ee-4ac0-9858-c82efd345bdc | 6 | troubleshootSSHOfflineWithSubnetId | aws:executeAutomation | Failed | Wed, 05 Apr 2023 01:38:03 GMT | Wed, 05 Apr 2023 01:38:12 GMT
f631870d-2d3b-40e7-b2df-0d7d4dc7419a | 7 | installEC2Rescue | aws:runCommand | Pending | - | -
28666d80-ecaa-437f-a668-f2cfece634d2 | 8 | troubleshootSSH | aws:runCommand | Pending | - | -
1822d358-a7e4-4208-a99f-ee1a6b109fbc | 9 | troubleshootSSHOffline | aws:executeAutomation | Pending | - | -
The lockout after some days smells like a firewall or network issue that's independent of Flintrock. But I'm not sure how to debug this because it could be so many different things. Do you have an AWS admin who can help you investigate?
It has happened to me multiple times that I am working with a cluster, all goes well for many hours. Then I ~/hadoop/sbin/stop-all.sh and type 'exit'. The next day, flintrock login times out, also ssh times out. I have to throw away the cluster and reinstall.
Is there any reason this could happen? I have already rebooted the instances in the EC2 dashboard. Is it locking out my ssh?
I update my .aws credentials in ~/.aws/credentials (using aws academy account).