Official .pex File Does not Support output_format="tfrecord"

zw615 commented 1 year ago

Hi, thanks for the amazing work!

I have been trying to download LAION-5B following the Distributed img2dataset tutorial, but failed. The reason I discovered is that the pex environment available at https://github.com/rom1504/img2dataset/releases/latest/download/img2dataset.pex does not contain necessary package like tensorflow or tensorflow_io, and I have been trying to use output_format="tfrecord". Simply switching to output_format="webdataset" leads to successful distributed downloading with significant speed up.

I wonder if there is a way to rebuild a img2dataset.pex with tfrecord support? Or can we simply skip this pex approach, manage to manually install the same python environment on all master/worker nodes, and call that python env? All of my master/worker nodes are of Ubuntu 20.04LTS.

Thanks a lot!

zw615 commented 1 year ago

By the way, I have tried specifying python interpreter on the run by using PEX_PYTHON, but failed without any luck.

zw615 commented 1 year ago

Update: I tried to make my own .pex file for python environment using the following code.

sudo apt update
sudo apt install -y python3.8-venv
python3 -m venv ~/venv/img2dataset_pyspark
. ~/venv/img2dataset_pyspark/bin/activate
pip3 install -U pip
pip3 install img2dataset==1.41.0 tensorflow==2.11 tensorflow_io==0.31.0 pex==2.1.131
pip3 install wandb==0.12.17 scipy==1.9.1 gcsfs==2022.11.0
pex pex==2.1.131 pyspark==3.2.0 img2dataset==1.41.0 gcsfs==2022.11.0 tensorflow==2.12.0 tensorflow_io==0.32.0 wandb==0.12.17 protobuf==3.20.3 scipy==1.9.1 -o img2dataset_pyspark.pex

I can verify that this new img2dataset_pyspark.pex does support output_format="tfrecord", but only with distributor=multiprocessing. Strangely, the program still fails with distributed downloading with distributor=pyspark.

rom1504 commented 1 year ago

Hey check build-pex target in the Makefile You can reuse that to create a new PEX

I didn't include tfrecord by default because tensorflow is heavy

On Thu, Apr 13, 2023, 08:27 zw @.***> wrote:

Update: I tried to make my own .pex file for python environment using the following code.

sudo apt update sudo apt install -y python3.8-venv python3 -m venv ~/venv/img2dataset_pyspark . ~/venv/img2dataset_pyspark/bin/activate pip3 install -U pip pip3 install img2dataset==1.41.0 tensorflow==2.11 tensorflow_io==0.31.0 pex==2.1.131 pip3 install wandb==0.12.17 scipy==1.9.1 gcsfs==2022.11.0 pex pex==2.1.131 pyspark==3.2.0 img2dataset==1.41.0 gcsfs==2022.11.0 tensorflow==2.12.0 tensorflow_io==0.32.0 wandb==0.12.17 protobuf==3.20.3 scipy==1.9.1 -o img2dataset_pyspark.pex

I can verify that this new img2dataset_pyspark.pex does support output_format="tfrecord", but only with distributor=multiprocessing. Strangely, the program still fails with distributed downloading with distributor=pyspark.

— Reply to this email directly, view it on GitHub https://github.com/rom1504/img2dataset/issues/286#issuecomment-1506417232, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAR437RXGD6GYLWRVH54SHLXA6MENANCNFSM6AAAAAAW3HTT3E . You are receiving this because you are subscribed to this thread.Message ID: @.***>

zw615 commented 1 year ago

Thanks! I added tfrecord dependencies based on the build-pex in the Makefile and succeeded!

rom1504 / img2dataset

Official .pex File Does not Support output_format="tfrecord" #286