nchammas / flintrock

A command-line tool for launching Apache Spark clusters.
Apache License 2.0
636 stars 116 forks source link

i4i instance type cluster fails to restart #376

Open a-cesari opened 2 months ago

a-cesari commented 2 months ago

Hi, I'm having issues when stopping and restarting the cluster. Stop is working fine (i.e. flintrock stop my-cluster). However when trying to start again (flintrock start my-cluster) the instances fails 1 of the 2 sanity checks, they cannot be reached event with console ssh login, and the cluster won't start. I'm guessing is something related to the ephemeral storage because (as you can see from the system log below) the instance is going in a "recovery mode" due to some errors related to ext4 partition non found

Mounting /media/ephemeral0...

[    4.953751] EXT4-fs (nvme1n1): VFS: Can't find ext4 filesystem

Do you have any guess? Thanks for your kind help. Andrea

Here a more complete log file. After you can find also my flintrock config.

        Starting Apply Kernel Variables...

[  OK  ] Started Apply Kernel Variables.

[  OK  ] Created slice system-ec2net\x2difup.slice.

         Starting Relabel kernel modules early in the boot, if needed...

[  OK  ] Started Relabel kernel modules early in the boot, if needed.

[  OK  ] Found device Elastic Network Adapter (ENA).

[  OK  ] Started Monitoring of LVM2 mirrors,...ng dmeventd or progress polling.

[  OK  ] Reached target Local File Systems (Pre).

         Mounting /media/ephemeral0...

[    4.953751] EXT4-fs (nvme1n1): VFS: Can't find ext4 filesystem
[FAILED] Failed to mount /media/ephemeral0.

See 'systemctl status media-ephemeral0.mount' for details.

[DEPEND] Dependency failed for Local File Systems.

[DEPEND] Dependency failed for Migrate local... structure to the new structure.

[DEPEND] Dependency failed for Relabel all filesystems, if necessary.

[DEPEND] Dependency failed for Mark the need to relabel after reboot.

         Starting Preprocess NFS configuration...

[  OK  ] Reached target Timers.

[  OK  ] Reached target Network (Pre).

[  OK  ] Reached target Cloud-init target.

[  OK  ] Reached target Network.

         Starting Initial cloud-init job (metadata service crawler)...

[  OK  ] Reached target Login Prompts.

[  OK  ] Reached target Paths.

[  OK  ] Reached target Sockets.

         Starting Create Volatile Files and Directories...

         Starting Tell Plymouth To Write Out Runtime Data...

[  OK  ] Started Emergency Shell.

[  OK  ] Reached target Emergency Mode.

[  OK  ] Started Preprocess NFS configuration.

[  OK  ] Started Create Volatile Files and Directories.

         Starting RPC bind service...

         Mounting RPC Pipe File System...

         Starting Security Auditing Service...

[    5.025955] RPC: Registered named UNIX socket transport module.
[    5.025956] RPC: Registered udp transport module.
[    5.025957] RPC: Registered tcp transport module.
[    5.025957] RPC: Registered tcp NFSv4.1 backchannel transport module.
[  OK  ] Started RPC bind service.

[  OK  ] Mounted RPC Pipe File System.

[  OK  ] Started Security Auditing Service.

         Starting Update UTMP about System Boot/Shutdown...

[  OK  ] Reached target rpc_pipefs.target.

[  OK  ] Reached target NFS client services.

[  OK  ] Reached target Remote File Systems (Pre).

[  OK  ] Reached target Remote File Systems.

[  OK  ] Started Update UTMP about System Boot/Shutdown.

         Starting Update UTMP about System Runlevel Changes...

[  OK  ] Started Update UTMP about System Runlevel Changes.

[  OK  ] Started Tell Plymouth To Write Out Runtime Data.

[  OK  ] Started udev Wait for Complete Device Initialization.

         Starting Activation of DM RAID sets...

[    5.305390] device-mapper: uevent: version 1.0.3
[    5.309580] device-mapper: ioctl: 4.43.0-ioctl (2020-10-01) initialised: dm-devel@redhat.com
[  OK  ] Started Activation of DM RAID sets.

[  OK  ] Reached target Local Encrypted Volumes.

[    4.977500] cloud-init[2346]: Cloud-init v. 19.3-46.amzn2.0.1 running 'init' at Thu, 02 May 2024 18:43:55 +0000. Up 4.95 seconds.

[    4.993484] cloud-init[2346]: ci-info: +++++++++++++++++++++++++++Net device info++++++++++++++++++++++++++++

[    4.997062] cloud-init[2346]: ci-info: +--------+-------+-----------+-----------+-------+-------------------+

[    4.997895] cloud-init[2346]: ci-info: | Device |   Up  |  Address  |    Mask   | Scope |     Hw-Address    |

[    4.997985] cloud-init[2346]: ci-info: +--------+-------+-----------+-----------+-------+-------------------+

[    4.999620] cloud-init[2346]: ci-info: |  eth0  | False |     .     |     .     |   .   | (masked by me) |

[    5.013097] cloud-init[2346]: ci-info: |   lo   |  True | 127.0.0.1 | 255.0.0.0 |  host |         .         |

[    5.016240] cloud-init[2346]: ci-info: |   lo   |  True |  ::1/128  |     .     |  host |         .         |

[    5.017904] cloud-init[2346]: ci-info: +--------+-------+-----------+-----------+-------+-------------------+

[    5.018004] cloud-init[2346]: ci-info: +++++++++++++++++++Route IPv6 info+++++++++++++++++++

[    5.021742] cloud-init[2346]: ci-info: +-------+-------------+---------+-----------+-------+

[    5.021831] cloud-init[2346]: ci-info: | Route | Destination | Gateway | Interface | Flags |

[    5.023449] cloud-init[2346]: ci-info: +-------+-------------+---------+-----------+-------+

[    5.044822] cloud-init[2346]: ci-info: +-------+-------------+---------+-----------+-------+

[  OK  ] Started Initial cloud-init job (metadata service crawler).

[  OK  ] Reached target Cloud-config availability.

[  OK  ] Reached target Network is Online.

         Starting Notify NFS peers of a restart...

[  OK  ] Started Notify NFS peers of a restart.

Welcome to emergency mode! After logging in, type "journalctl -xb" to view
system logs, "systemctl reboot" to reboot, "systemctl default" or ^D to
try again to boot into default mode.

Cannot open access to console, the root account is locked.
See sulogin(8) man page for more details.

Press Enter to continue.
services:
  spark:
    version: 3.5.1
    download-source: "s3://xxxx/flintrock/spark/spark-{v}/"
    # executor-instances: 1
  hdfs:
    version: 3.3.6
    download-source: "s3://xxxx/flintrock/hadoop/hadoop-{v}/"
provider: ec2

providers:
  ec2:
    key-name: xxx
    identity-file: /home/xxx/spark/xxx.pem
    instance-type: i4i.xlarge
    #instance-type: m5d.large
    region: eu-central-1
    # availability-zone: <name>
    ami: ami-0a946522147cbcbcc  # Amazon Linux 2, us-east-1
    user: ec2-user
    # spot-price: <price>
    vpc-id: *masked*
    subnet-id: *masked*
    # placement-group: <name>
    security-groups:
     - sg_xxx
    #   - group-name2
    instance-profile-name: role_xx
    tags:
      - owner,spark_cluster
    #   - key2, value2  # leading/trailing spaces are trimmed
    #   - key3,  # value will be empty
    # min-root-ebs-size-gb: <size-gb>
    tenancy: default  # default | dedicated
    ebs-optimized: no  # yes | no
      #min-root-ebs-size-gb: 120
    instance-initiated-shutdown-behavior: terminate  # terminate | stop
    user-data: /home/ec2-user/spark/user-data.sh
    # authorize-access-from:
    #   - 10.0.0.42/32
    #   - sg-xyz4654564xyz

launch:
  num-slaves: 1
  install-hdfs: True
  install-spark: True
  # java-version: 8

debug: true
nchammas commented 1 month ago

What is ami-0a946522147cbcbcc? Is it one of the default Amazon Linux AMIs provided by Amazon? If not, could you try one of those, please?

a-cesari commented 1 month ago

Hi @nchammas , yes it's an official Amazon Linux 2 image

a-cesari commented 1 month ago

If you have an already know working combination of instance type and ami, I can try with them to check if it's a problem related to ami or instance type.

nchammas commented 1 month ago

Hi @nchammas , yes it's an official Amazon Linux 2 image

Can you show me where exactly you are seeing that? I am not able to find mention of this AMI in the official listing from Amazon.

I just tried to launch, stop, and then start a cluster using ami-0588935a949f9ff17 and it worked fine for me.

a-cesari commented 1 month ago

Hi @nchammas , yes it's an official Amazon Linux 2 image

Can you show me where exactly you are seeing that? I am not able to find mention of this AMI in the official listing from Amazon.

I just tried to launch, stop, and then start a cluster using ami-0588935a949f9ff17 and it worked fine for me.

I can only use amis in eu-central-1. And I can't find the one you are mentioning in eu-central-1 region. I now tried with this one (probably they also updated it during these days) but still same problem

image

nchammas commented 1 month ago

I'm not sure where ami-0578f46b79ca9e3e7 is coming from, either. Please try an AMI returned by this list:

aws ec2 describe-images \
    --region eu-central-1 \
    --owners amazon \
    --filters \
        "Name=name,Values=amzn2-ami-hvm-*-gp2" \
        "Name=root-device-type,Values=ebs" \
        "Name=virtualization-type,Values=hvm" \
        "Name=architecture,Values=x86_64" \
    --query \
        'reverse(sort_by(Images, &CreationDate))[:100].{CreationDate:CreationDate,ImageId:ImageId,Name:Name,Description:Description}'

Please also try a different instance type, like m6i.large. Different instance types have different storage configurations. Flintrock is tested against a very small set of the possible storage configurations.

a-cesari commented 1 month ago

I'm not sure where ami-0578f46b79ca9e3e7 is coming from, either. Please try an AMI returned by this list:

aws ec2 describe-images \
    --region eu-central-1 \
    --owners amazon \
    --filters \
        "Name=name,Values=amzn2-ami-hvm-*-gp2" \
        "Name=root-device-type,Values=ebs" \
        "Name=virtualization-type,Values=hvm" \
        "Name=architecture,Values=x86_64" \
    --query \
        'reverse(sort_by(Images, &CreationDate))[:100].{CreationDate:CreationDate,ImageId:ImageId,Name:Name,Description:Description}'

Please also try a different instance type, like m6i.large. Different instance types have different storage configurations. Flintrock is tested against a very small set of the possible storage configurations.

Hi, thanks for the suggestion. Indeed it's a problem of finding the instance type. The following combos are now working in my case:

instance_type ami launch destroy restart (stop + start)
m6i.large ami-0121de3d416d6f6a2 yes yes yes
m6i.large ami-0578f46b79ca9e3e7 yes yes yes
m5.large ami-0578f46b79ca9e3e7 yes yes yes
i4i.xlarge ami-0578f46b79ca9e3e7 yes yes NO

It would be nice to understand what's the difference in storage config of the i4i. However not a big issue for me. I can use other instance types. Thanks a lot for the support. Feel free to close the issue if you wish.

Andrea

nchammas commented 1 month ago

I will leave the issue open and re-title it to focus on this storage-related problem. Flintrock should handle it more gracefully, even if we don't support it.