nchammas / flintrock

A command-line tool for launching Apache Spark clusters.
Apache License 2.0
637 stars 116 forks source link

SPARK_PUBLIC_DNS is incorrectly set when launching into a private VPC #346

Open maxpoulain opened 2 years ago

maxpoulain commented 2 years ago

Hi,

We are having issues in the Spark UI notably when using Flintrock. To have more context on our use of Flintrock:

When we go to the Spark UI on port 8080, the page is displayed correctly but the links to the other pages are broken. Here you can find an extract from the HTML code of the page for a link to a worker.

<a href="http://<?xml version=&quot;1.0&quot; encoding=&quot;iso-8859-1&quot;?>
<!DOCTYPE html PUBLIC &quot;-//W3C//DTD XHTML 1.0 Transitional//EN&quot;
         &quot;http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd&quot;>
<html xmlns=&quot;http://www.w3.org/1999/xhtml&quot; xml:lang=&quot;en&quot; lang=&quot;en&quot;>
 <head>
  <title>404 - Not Found</title>
 </head>
 <body>
  <h1>404 - Not Found</h1>
 </body>
</html>:8081">
              worker-20211015082848-<MASKED_IP>-42451
            </a>

I have a similar error message when I launch spark-shell for example:

Spark context Web UI available at http://<?xml version="1.0" encoding="iso-8859-1"?>
<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN"
         "http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd">
<html xmlns="http://www.w3.org/1999/xhtml" xml:lang="en" lang="en">
 <head>
  <title>404 - Not Found</title>
 </head>
 <body>
  <h1>404 - Not Found</h1>
 </body>
</html>:4040

I have the impression that the error comes from a problem finding the ip address or something related.. maybe we have done a mistake with our configuration or maybe it's not related to Flintrock..

So, if you have any clues or elements to solve this problem, this would be a great help for us.

Thank you in advance for your help,

Maxime

nchammas commented 2 years ago

That looks pretty weird. So instead of the link pointing to an IP address or host name, it literally points to a block of HTML?

Can you share your Flintrock config? Do you see the same behavior if you launch a cluster on a public VPC?

maxpoulain commented 2 years ago

Yes its pretty weird and yes it seems to point to a block of HTML instead of IP address or host name..

Here is our Flintrock config:

services:
  spark:
    version: 2.2.0
    download-source: s3://our-bucket/spark-related-packages/
  hdfs:
    version: 2.7.3
    download-source: s3://our-bucket/spark-related-packages/

provider: ec2

providers:
  ec2:
    key-name: key
    identity-file: key.pem
    instance-type: m5.2xlarge
    region: eu-west-1
    availability-zone: eu-west-1c
    ami: our-custom-ami # Based on Amazon Linux 2 AMI
    user: ec2-user
    spot-price: 0.4
    vpc-id: our-vpc-id
    subnet-id: our-subnet-id
    instance-profile-name: our-role
    tags:
        - TEAM,DATA
    min-root-ebs-size-gb: 120
    tenancy: default 
    ebs-optimized: no
    instance-initiated-shutdown-behavior: terminate
    authorize-access-from:
      - X.X.X.X/8
      - Y.Y.Y.Y/8

launch:
  num-slaves: 3
  install-hdfs: True
  install-spark: True
  java-version: 8

I just tried to launch a cluster on a public VPC and it's working well without any error ! So it seems to be related to the private VPC..

nchammas commented 2 years ago

Is it just the UI that's broken? I would expect something to be wrong with the cluster too.

Can you post the full contents of the files under spark/conf on the cluster master (in the case where the UI is broken)?

maxpoulain commented 2 years ago

I think I just found the problem inside spark/conf/spark-env.sh ! There is a curl to have SPARK_PUBLIC_DNS but it returns :

<?xml version="1.0" encoding="iso-8859-1"?>
<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN"
         "http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd">
<html xmlns="http://www.w3.org/1999/xhtml" xml:lang="en" lang="en">
 <head>
  <title>404 - Not Found</title>
 </head>
 <body>
  <h1>404 - Not Found</h1>
 </body>
</html>

Here is spark/conf/spark-env.sh file:

#!/usr/bin/env bash

export SPARK_LOCAL_DIRS="/media/root/spark"

# Standalone cluster options
export SPARK_EXECUTOR_INSTANCES="1"
export SPARK_EXECUTOR_CORES="$(($(nproc) / 1))"
export SPARK_WORKER_CORES="$(nproc)"

export SPARK_MASTER_HOST="<masked_master_hostname>"

# TODO: Make this dependent on HDFS install.
export HADOOP_CONF_DIR="$HOME/hadoop/conf"

# TODO: Make this non-EC2-specific.
# Bind Spark's web UIs to this machine's public EC2 hostname
export SPARK_PUBLIC_DNS="$(curl --silent http://169.254.169.254/latest/meta-data/public-hostname)"

# TODO: Set a high ulimit for large shuffles
# Need to find a way to do this, since "sudo ulimit..." doesn't fly.
# Probably need to edit some Linux config file.
# ulimit -n 1000000

# Should this be made part of a Python service somehow?
export PYSPARK_PYTHON="python3"
maxpoulain commented 2 years ago

It seems that http://169.254.169.254/latest/meta-data/public-hostname is not working no ? Because when I do curl http://169.254.169.254/latest/meta-data/ I have :

ami-id
ami-launch-index
ami-manifest-path
block-device-mapping/
events/
hostname
iam/
identity-credentials/
instance-action
instance-id
instance-life-cycle
instance-type
local-hostname
local-ipv4
mac
metrics/
network/
placement/
profile
public-keys/
reservation-id
security-groups

There is no public-hostname !

nchammas commented 2 years ago

OK, it sounds like we need to understand how to set SPARK_PUBLIC_DNS when launching into a private VPC. Do things work if it's just left unset?

maxpoulain commented 2 years ago

I just tried to launch a new cluster into private VPC by commenting out the line of SPARK_PUBLIC_DNS like that:

# export SPARK_PUBLIC_DNS="$(curl --silent http://169.254.169.254/latest/meta-data/public-hostname)"

And it seems to work perfectly ! There is no error and we no longer have the previous error !

nchammas commented 2 years ago

OK great. Maybe we don't need this config at all anymore, or maybe we only need it when launching into a public VPC.