mrn-aglic / pyspark-playground

MIT License
44 stars 32 forks source link

Spark History #5

Open marcelo-franceschini opened 2 weeks ago

marcelo-franceschini commented 2 weeks ago

Thank you for sharing this repository.

I've been experimenting with it and encountered an issue with the Spark History Server. I've tried adjusting some environment configurations for both the master and worker nodes, but I still can't get it to display the logs.

image image

Here's my configuration docker compose file:

services:
  spark-master:
    container_name: da-spark-master
    user: root
    build: .
    image: da-spark-image
    entrypoint: ['./entrypoint.sh', 'master']
    environment:
      - SPARK_EVENTLOG_ENABLED=true
      - SPARK_EVENTLOG_DIR=/opt/spark/spark-events 
    healthcheck:
      test: ["CMD", "curl", "-f", "http://localhost:8080"]
      interval: 5s
      timeout: 3s
      retries: 3
    volumes:
      - ./book_data:/opt/spark/data
      - ./spark_apps:/opt/spark/apps
      - /mnt/c/Users/boss/Desktop/logs:/opt/spark/spark-events
      - /mnt/d/DADOS:/home/jovyan/work
    env_file:
      - .env.spark
    ports:
      - '9090:8080'
      - '7077:7077'

  spark-history-server:
    container_name: da-spark-history
    user: root
    image: da-spark-image
    entrypoint: ['./entrypoint.sh', 'history']
    environment:
      - SPARK_EVENTLOG_ENABLED=true
      - SPARK_EVENTLOG_DIR=/opt/spark/spark-events 
    depends_on:
      - spark-master
    env_file:
      - .env.spark
    volumes:
      - /mnt/c/Users/boss/Desktop/logs:/opt/spark/spark-events
      - /mnt/d/DADOS:/home/jovyan/work
    ports:
      - '18080:18080'

  spark-worker:
    image: da-spark-image
    user: root
    entrypoint: ['./entrypoint.sh', 'worker']
    environment:
      - SPARK_EVENTLOG_ENABLED=true
      - SPARK_EVENTLOG_DIR=/opt/spark/spark-events 
    depends_on:
      - spark-master
    env_file:
      - .env.spark
    volumes:
      - ./book_data:/opt/spark/data
      - ./spark_apps:/opt/spark/apps
      - /mnt/c/Users/boss/Desktop/logs:/opt/spark/spark-events
      - /mnt/d/DADOS:/home/jovyan/work

  jupyter:
    image: quay.io/jupyter/pyspark-notebook:latest
    user: root
    container_name: Jupyter_Notebook
    depends_on:
      - spark-master
    ports:
      - '8888:8888'
      - '4040:4040'
    volumes:
      - /mnt/d/DADOS:/home/jovyan/work      
    environment:
      - JUPYTER_ENABLE_LAB=yes
      - PYSPARK_PYTHON=python3
      - SPARK_HOME=/usr/local/spark
      - PYSPARK_DRIVER_PYTHON=jupyter
      - PYSPARK_DRIVER_PYTHON_OPTS=notebook
      - USE_JUPYTER_LAB=yes
    restart: always
    command: start-notebook.sh --NotebookApp.token='' --NotebookApp.password='' --NotebookApp.notebook_dir='/home/jovyan/work'

image

Thanks!

mrn-aglic commented 2 weeks ago

I’m not sure as to why the path in the volume starts with /mnt. Also, note that I didn’t use windows for a couple of years :-)

marcelo-franceschini commented 2 weeks ago

/mnt/c/Users/boss/Desktop/logs corresponds to C:\Users\Boss\Desktop\logs on WSL2. I also tried to run it on my Debian machine, but I couldn't get the Spark History Server to display the logs as well.

mrn-aglic commented 2 weeks ago

Have you tried executing into the container to check whether the logs are written there? If yes, then it would indicate a volume mapping issue