josephmachado / efficient_data_processing_spark

Code for "Efficient Data Processing in Spark" Course
https://josephmachado.podia.com/efficient-data-processing-in-spark
216 stars 51 forks source link

Spark stuck while use the built-in Java classes as a fallback #8

Closed giuliosmall closed 1 month ago

giuliosmall commented 1 month ago

@josephmachado

Description:

I am encountering an issue where my Spark job hangs indefinitely without performing any tasks. The job does not show any critical errors in the logs, but it seems to be stuck during execution. Additionally, I am unable to reach the Spark UI at http://localhost:4040/, which may be related to the problem.

Steps to Reproduce:

  1. Run the Spark job using the command: make cr
  2. The job begins execution, with dependencies being resolved successfully.
  3. The job does not proceed to perform any Spark tasks and appears to be stuck.
  4. Attempting to access the Spark UI at http://localhost:4040/ fails, indicating that the driver may not be running or is not accessible.

Observed Behavior:

• The job initializes and resolves dependencies but then hangs indefinitely without performing any Spark tasks. • The Spark UI at http://localhost:4040/ is not accessible, possibly indicating an issue with the driver process.

Expected Behavior:

• The Spark job should proceed with execution after dependencies are resolved. • The Spark UI should be accessible at http://localhost:4040/ or another appropriate port if 4040 is in use.

Logs:

(make cr)
Enter pyspark relative path:data-processing-spark/4-data-processing/2-app-job-stage-task/spark_app_anatomy.py
:: loading settings :: url = jar:file:/opt/spark/jars/ivy-2.5.1.jar!/org/apache/ivy/core/settings/ivysettings.xml
Ivy Default Cache set to: /root/.ivy2/cache
The jars for the packages stored in: /root/.ivy2/jars
io.delta#delta-core_2.12 added as a dependency
org.apache.hadoop#hadoop-aws added as a dependency
org.postgresql#postgresql added as a dependency
:: resolving dependencies :: org.apache.spark#spark-submit-parent-cc6f9c86-4d44-4f62-826f-49cd7a6fe587;1.0
    confs: [default]
    found io.delta#delta-core_2.12;2.3.0 in central
    found io.delta#delta-storage;2.3.0 in central
    found org.antlr#antlr4-runtime;4.8 in central
    found org.apache.hadoop#hadoop-aws;3.3.2 in central
    found com.amazonaws#aws-java-sdk-bundle;1.11.1026 in central
    found org.wildfly.openssl#wildfly-openssl;1.0.7.Final in central
    found org.postgresql#postgresql;42.7.3 in central
    found org.checkerframework#checker-qual;3.42.0 in central
:: resolution report :: resolve 513ms :: artifacts dl 16ms
    :: modules in use:
    com.amazonaws#aws-java-sdk-bundle;1.11.1026 from central in [default]
    io.delta#delta-core_2.12;2.3.0 from central in [default]
    io.delta#delta-storage;2.3.0 from central in [default]
    org.antlr#antlr4-runtime;4.8 from central in [default]
    org.apache.hadoop#hadoop-aws;3.3.2 from central in [default]
    org.checkerframework#checker-qual;3.42.0 from central in [default]
    org.postgresql#postgresql;42.7.3 from central in [default]
    org.wildfly.openssl#wildfly-openssl;1.0.7.Final from central in [default]
    ---------------------------------------------------------------------
    |                  |            modules            ||   artifacts   |
    |       conf       | number| search|dwnlded|evicted|| number|dwnlded|
    ---------------------------------------------------------------------
    |      default     |   8   |   0   |   0   |   0   ||   8   |   0   |
    ---------------------------------------------------------------------
:: retrieving :: org.apache.spark#spark-submit-parent-cc6f9c86-4d44-4f62-826f-49cd7a6fe587
    confs: [default]
    0 artifacts copied, 8 already retrieved (0kB/9ms)
24/08/28 07:48:57 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
giuliosmall commented 1 month ago

After a bit of searches I manage to solve it.

If you are using mac M1 or later, please replace the FROM deltaio/delta-docker:latest in data-processing-spark/1-lab-setup/containers/spark/Dockerfile with FROM deltaio/delta-docker:latest_arm64