Closed zw615 closed 1 year ago
By the way, I have tried specifying python interpreter on the run by using PEX_PYTHON
, but failed without any luck.
Update: I tried to make my own .pex file for python environment using the following code.
sudo apt update
sudo apt install -y python3.8-venv
python3 -m venv ~/venv/img2dataset_pyspark
. ~/venv/img2dataset_pyspark/bin/activate
pip3 install -U pip
pip3 install img2dataset==1.41.0 tensorflow==2.11 tensorflow_io==0.31.0 pex==2.1.131
pip3 install wandb==0.12.17 scipy==1.9.1 gcsfs==2022.11.0
pex pex==2.1.131 pyspark==3.2.0 img2dataset==1.41.0 gcsfs==2022.11.0 tensorflow==2.12.0 tensorflow_io==0.32.0 wandb==0.12.17 protobuf==3.20.3 scipy==1.9.1 -o img2dataset_pyspark.pex
I can verify that this new img2dataset_pyspark.pex
does support output_format="tfrecord", but only with distributor=multiprocessing
. Strangely, the program still fails with distributed downloading with distributor=pyspark
.
Hey check build-pex target in the Makefile You can reuse that to create a new PEX
I didn't include tfrecord by default because tensorflow is heavy
On Thu, Apr 13, 2023, 08:27 zw @.***> wrote:
Update: I tried to make my own .pex file for python environment using the following code.
sudo apt update sudo apt install -y python3.8-venv python3 -m venv ~/venv/img2dataset_pyspark . ~/venv/img2dataset_pyspark/bin/activate pip3 install -U pip pip3 install img2dataset==1.41.0 tensorflow==2.11 tensorflow_io==0.31.0 pex==2.1.131 pip3 install wandb==0.12.17 scipy==1.9.1 gcsfs==2022.11.0 pex pex==2.1.131 pyspark==3.2.0 img2dataset==1.41.0 gcsfs==2022.11.0 tensorflow==2.12.0 tensorflow_io==0.32.0 wandb==0.12.17 protobuf==3.20.3 scipy==1.9.1 -o img2dataset_pyspark.pex
I can verify that this new img2dataset_pyspark.pex does support output_format="tfrecord", but only with distributor=multiprocessing. Strangely, the program still fails with distributed downloading with distributor=pyspark.
— Reply to this email directly, view it on GitHub https://github.com/rom1504/img2dataset/issues/286#issuecomment-1506417232, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAR437RXGD6GYLWRVH54SHLXA6MENANCNFSM6AAAAAAW3HTT3E . You are receiving this because you are subscribed to this thread.Message ID: @.***>
Thanks! I added tfrecord dependencies based on the build-pex in the Makefile and succeeded!
Hi, thanks for the amazing work!
I have been trying to download LAION-5B following the Distributed img2dataset tutorial, but failed. The reason I discovered is that the pex environment available at https://github.com/rom1504/img2dataset/releases/latest/download/img2dataset.pex does not contain necessary package like
tensorflow
ortensorflow_io
, and I have been trying to use output_format="tfrecord". Simply switching to output_format="webdataset" leads to successful distributed downloading with significant speed up.I wonder if there is a way to rebuild a
img2dataset.pex
with tfrecord support? Or can we simply skip this pex approach, manage to manually install the same python environment on all master/worker nodes, and call that python env? All of my master/worker nodes are of Ubuntu 20.04LTS.Thanks a lot!