nchammas / flintrock

A command-line tool for launching Apache Spark clusters.
Apache License 2.0
636 stars 116 forks source link

Fail to launch flintrock using Adoptuim #363

Closed ninazhou98 closed 7 months ago

ninazhou98 commented 9 months ago

Error Message: failure: repodata/repomd.xml from Adoptium: [Errno 256] No more mirrors to try. https://packages.adoptium.net/artifactory/rpm/amazonlinux/2/x86_64/repodata/repomd.xml: [Errno 14] curl#60 - "SSL certificate problem: certificate has expired"

gdams commented 8 months ago

Hey 👋🏼 Adoptium maintainer here, we recently fronted our Artificatory server with a Fastly CDN and this seems to have broken things for your users. I've got a few questions I'd like to ask:

  1. Is there a reason that flintrock uses Centos 8? It's been EOL for more than 2 years.
  2. Are you able to check if you're running the latest version of the ca-certificates package before attempting to install temurin? I suspect that the default set of certificates is pretty outdated now.
nchammas commented 8 months ago

Is there a reason that flintrock uses Centos 8? It's been EOL for more than 2 years.

Flintrock defaults to Amazon Linux 2. Some users may opt to use their own AMIs with Flintrock, which they are free to do as long as the distribution is in the Fedora/yum family of distributions, including CentOS.

Are you able to check if you're running the latest version of the ca-certificates package before attempting to install temurin? I suspect that the default set of certificates is pretty outdated now.

How would users do that? sudo yum update ca-certificates?

nchammas commented 8 months ago

For the record, I do not have any issues launching new clusters with ami-0cabc39acf991f4f1, which is one of the vanilla Amazon Linux 2 AMIs.

@ninazhou98 - Can you share the AMI you are using in your Flintrock config?

ninazhou98 commented 8 months ago

For the record, I do not have any issues launching new clusters with ami-0cabc39acf991f4f1, which is one of the vanilla Amazon Linux 2 AMIs.

@ninazhou98 - Can you share the AMI you are using in your Flintrock config?

Thank you for helping out!

nchammas commented 8 months ago
  1. (a) What region is that AMI in and (b) is it something you built or did you get it from somewhere?
  2. Do cluster launches work if you use the default Amazon Linux 2 AMI I referenced in my previous message? It's in us-east-1.
ninazhou98 commented 8 months ago
  1. (a) What region is that AMI in and (b) is it something you built or did you get it from somewhere?
  2. Do cluster launches work if you use the default Amazon Linux 2 AMI I referenced in my previous message? It's in us-east-1.
  1. (a) It's in us-east-1 (b) This AMI is a customized AWS image that contains all of the dependencies required.
  2. Will check after getting permission from security team. Thank you again for the support.
Xisikka commented 8 months ago

hi,

@nchammas, Thanks for your information. I'm using my private AMI and encountered the ssl certificate expire issue as well from adoptium java installation. But using your AMI listed here works fine. Do you have any idea what could be the issue in the AMI that caused the differences?

Thanks,

gdams commented 8 months ago

@Xisikka I suspect the custom AMI's that you are running are potentially using outdated ca-certificate bundles. It seems that running yum update ca-certificates is fixing the problems for most users.

nchammas commented 8 months ago
  1. (a) It's in us-east-1 (b) This AMI is a customized AWS image that contains all of the dependencies required.

I think I am not able to test out your AMI due to permissions. The AMI I posted is just a generic Amazon Linux 2 AMI published directly by Amazon.

@ninazhou98 and @Xisikka - Please report back here on whether updating ca-certificates resolves the certificate issue for you or not.

Xisikka commented 8 months ago

@nchammas @gdams Unfortunately, I tried and yum update ca-certificates doesn't work. it's still getting the same type of errors. The weird thing I observed from my side is---I was able to launch a few clusters, but when the number of clusters increased it started throwing the SSL expired errors associated with adoptium.

gdams commented 8 months ago

@nchammas @gdams Unfortunately, I tried and yum update ca-certificates doesn't work. it's still getting the same type of errors. The weird thing I observed from my side is---I was able to launch a few clusters, but when the number of clusters increased it started throwing the SSL expired errors associated with adoptium.

Are you sure that the yum update ca-certificates command is always running before the Temurin install?

Xisikka commented 8 months ago

@gdams

Hi @gdams @nchammas

Here is the function in flintrock source code to install Temurin. I added sudo yum -y update ca-certificates above sudo yum install -q -y {jp}, this should be the right place to update ca-certificates i think? Please let me know if that looks correct to you. Thanks!

def ensure_java(client: paramiko.client.SSHClient, java_version: int):
    """
    Ensures that Java is available on the machine and that it has a
    version of at least java_version.

    The specified version of Java will be installed if it does not
    exist or the existing version has a major version lower than java_version.

    :param client:
    :param java_version:
        minimum version of Java required
    :return:
    """
    host = client.get_transport().getpeername()[0]
    installed_java_version = get_installed_java_version(client)

    if installed_java_version == java_version:
        logger.info("Java {j} is already installed, skipping Java install".format(j=installed_java_version))
        return

    if installed_java_version and installed_java_version > java_version:
        logger.warning("""
            Existing Java {j} installation is newer than the configured version {java_version}.
            Your applications will be executed with Java {j}.
            Please choose a different AMI if this does not work for you.
            """.format(j=installed_java_version, java_version=java_version))
        return

    if installed_java_version and installed_java_version < java_version:
        logger.info("""
                Existing Java {j} will be upgraded to Adoptium OpenJDK {java_version}
                """.format(j=installed_java_version, java_version=java_version))

    # We will install Adoptium OpenJDK because it gives us access to Java 8 through 15
    # Right now, Amazon Extras only provides Corretto Java 8, 11 and 15
    logger.info("[{h}] Installing Adoptium OpenJDK Java {j}...".format(h=host, j=java_version))

    install_adoptium_repo(client) #this takes the repo path connecting to adotpium package
    java_package = "temurin-{j}-jdk".format(j=java_version)
    ssh_check_output(
        client=client,
        command="""
            set -e

            # Install Java first to protect packages that depend on Java from being removed.

            sudo yum -y update ca-certificates              <<<<<<<<<<

            sudo yum install -q -y {jp}

            # Remove any older versions of Java to force the default Java to the requested version.
            # We don't use /etc/alternatives because it does not seem to update links in /usr/lib/jvm correctly,
            # and we don't just rely on JAVA_HOME because some programs use java directly in the PATH.
            sudo yum remove -y java-1.6.0-openjdk java-1.7.0-openjdk

            sudo sh -c "echo export JAVA_HOME=/usr/lib/jvm/{jp} >> /etc/environment"
            source /etc/environment
        """.format(jp=java_package))
nchammas commented 8 months ago

@Xisikka - That looks correct to me.

@Xisikka / @ninazhou98 - If you would still like to investigate why your own AMIs are not working, I will need access to launch my own clusters with those AMIs so I can investigate.

If using a plain Amazon Linux 2 AMI solves your problem (maybe you can recreate your custom AMIs from that starting point?), then please let me know so I can close this issue.

The way I find Amazon's own default AMIs is using this AWS CLI query:

aws ec2 describe-images \
    --owners amazon \
    --filters \
        "Name=name,Values=amzn2-ami-hvm-*-gp2" \
        "Name=root-device-type,Values=ebs" \
        "Name=virtualization-type,Values=hvm" \
        "Name=architecture,Values=x86_64" \
    --query \
        'reverse(sort_by(Images, &CreationDate))[:100].{CreationDate:CreationDate,ImageId:ImageId,Name:Name,Description:Description}'
nchammas commented 7 months ago

Happy to reopen this issue if there is more information about the cause of this problem, or if I can access the AMIs that were previously reported to be a problem.