kubeflow / spark-operator

Kubernetes operator for managing the lifecycle of Apache Spark applications on Kubernetes.
Apache License 2.0
2.75k stars 1.36k forks source link

Pyspark module not found: what am I doing wrong? #1614

Open benczekristof opened 1 year ago

benczekristof commented 1 year ago

Hi All, I would like to ask a question regarding the pyton support of a SparkApplication: I am trying to run a sparkapplication where i mount the pyspark python file from a configmap to the Driver and Executor pod. It goes fine but when the spark submit happens the pyspark file fails with the following error message:

File "/opt/spark/examples/src/main/python/pyspark.py", line 4, in <module>
    from pyspark.sql import SparkSession
  File "/opt/spark/examples/src/main/python/..2022_09_23_07_11_05.3779784609/pyspark.py", line 4, in <module>
    from pyspark.sql import SparkSession

ModuleNotFoundError: No module named 'pyspark.sql'; 'pyspark' is not a package I believe this shouldn’t happen because pyspark is internal part of the base image. Also if i run the python example (https://github.com/GoogleCloudPlatform/spark-on-k8s-operator/blob/master/examples/spark-py-pi.yaml) which also use the same import(from pyspark.sql import SparkSession) but this one gets executed flawlessly. Could you please let me know what i am doing wrong? My SparkApplication manifest:

apiVersion: "sparkoperator.k8s.io/v1beta2"
kind: SparkApplication
metadata:
  name: pyspark-configmap
  namespace: spark
spec:
  type: Python
  pythonVersion: "3"
  mode: cluster
  image: "gcr.io/spark-operator/spark-py:v3.1.1"
  imagePullPolicy: Always
  mainApplicationFile: local:///opt/spark/examples/src/main/python/pyspark.py
  sparkVersion: "3.1.1"
  volumes:
    - name: spark-file
      configMap:
        name: pyspark-read-stuff
        items:
          - key: pyspark.py
            path: pyspark.py
    - name: csv-file
      configMap:
        name: samplecsv
  restartPolicy:
    type: OnFailure
    onFailureRetries: 3
    onFailureRetryInterval: 10
    onSubmissionFailureRetries: 5
    onSubmissionFailureRetryInterval: 20
  driver:
    cores: 1
    coreLimit: "1200m"
    memory: "512m"
    labels:
      version: 3.1.1
    serviceAccount: spark-apps
    volumeMounts:
      - name: spark-file
        mountPath: /opt/spark/examples/src/main/python
      - name: csv-file
        mountPath: /content
  executor:
    cores: 1
    instances: 1
    memory: "512m"
    labels:
      version: 3.1.1
    serviceAccount: spark-apps
    volumeMounts:
      - name: spark-file
        mountPath: /opt/spark/examples/src/main/python/pyspark.py
      - name: csv-file
        mountPath: /content

My pyspark.py file:

import sys
from random import random
from operator import add
from pyspark.sql import SparkSession

if __name__ == "__main__":
     spark = SparkSession.builder.appName('Read CSV File into DataFrame').getOrCreate()
     df = spark.read.csv('/content/sample.csv', sep=',',inferSchema=True, header=True)
     df.head()
github-actions[bot] commented 2 weeks ago

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.