Open AbdealiLoKo opened 8 months ago
I am able to get it working locally. But for yarn documentation i am not able to make it work. I tried:
gcloud dataproc jobs submit pyspark "gs://hello_world.py" \
--project wmt-bfdms-dvhorizprod \
--cluster=ipi-cluster-prod \
--region=us-east4 \
--archives 'gs://env/environment.tar.gz#environment' \
--properties="spark.submit.deployMode=cluster,\
spark.yarn.appMasterEnv.PYSPARK_PYTHON=./environment/bin/python,\
spark.appMasterEnv.PYSPARK_DRIVER_PYTHON=./environment/bin/python"
and i am getting ./environment/bin/python
not found
This is because of symlinks. In the archive you have symlink to local python executable. And probably on spark cluster it is located somewhere else ans symlink is invalid. You can change it with --python-prefix, but at the end it produces very strange path. I was not able to force it to point to correct one.
I see documentation about spark on yarn.
Does this also work with spark local mode ? I sometimes use Spark Local for small jobs and I would rather keep my environments consistent with small or large jobs ...
Some documentation would be useful - if I try copying the same stuff from the yarn documentation - it does not seem to be picking up the venvpack environment