rom1504 / img2dataset

Easily turn large sets of image urls to an image dataset. Can download, resize and package 100M urls in 20h on one machine.
MIT License
3.71k stars 338 forks source link

Official .pex File Does not Support output_format="tfrecord" #286

Closed zw615 closed 1 year ago

zw615 commented 1 year ago

Hi, thanks for the amazing work!

I have been trying to download LAION-5B following the Distributed img2dataset tutorial, but failed. The reason I discovered is that the pex environment available at https://github.com/rom1504/img2dataset/releases/latest/download/img2dataset.pex does not contain necessary package like tensorflow or tensorflow_io, and I have been trying to use output_format="tfrecord". Simply switching to output_format="webdataset" leads to successful distributed downloading with significant speed up.

I wonder if there is a way to rebuild a img2dataset.pex with tfrecord support? Or can we simply skip this pex approach, manage to manually install the same python environment on all master/worker nodes, and call that python env? All of my master/worker nodes are of Ubuntu 20.04LTS.

Thanks a lot!

zw615 commented 1 year ago

By the way, I have tried specifying python interpreter on the run by using PEX_PYTHON, but failed without any luck.

zw615 commented 1 year ago

Update: I tried to make my own .pex file for python environment using the following code.

sudo apt update
sudo apt install -y python3.8-venv
python3 -m venv ~/venv/img2dataset_pyspark
. ~/venv/img2dataset_pyspark/bin/activate
pip3 install -U pip
pip3 install img2dataset==1.41.0 tensorflow==2.11 tensorflow_io==0.31.0 pex==2.1.131
pip3 install wandb==0.12.17 scipy==1.9.1 gcsfs==2022.11.0
pex pex==2.1.131 pyspark==3.2.0 img2dataset==1.41.0 gcsfs==2022.11.0 tensorflow==2.12.0 tensorflow_io==0.32.0 wandb==0.12.17 protobuf==3.20.3 scipy==1.9.1 -o img2dataset_pyspark.pex

I can verify that this new img2dataset_pyspark.pex does support output_format="tfrecord", but only with distributor=multiprocessing. Strangely, the program still fails with distributed downloading with distributor=pyspark.

rom1504 commented 1 year ago

Hey check build-pex target in the Makefile You can reuse that to create a new PEX

I didn't include tfrecord by default because tensorflow is heavy

On Thu, Apr 13, 2023, 08:27 zw @.***> wrote:

Update: I tried to make my own .pex file for python environment using the following code.

sudo apt update sudo apt install -y python3.8-venv python3 -m venv ~/venv/img2dataset_pyspark . ~/venv/img2dataset_pyspark/bin/activate pip3 install -U pip pip3 install img2dataset==1.41.0 tensorflow==2.11 tensorflow_io==0.31.0 pex==2.1.131 pip3 install wandb==0.12.17 scipy==1.9.1 gcsfs==2022.11.0 pex pex==2.1.131 pyspark==3.2.0 img2dataset==1.41.0 gcsfs==2022.11.0 tensorflow==2.12.0 tensorflow_io==0.32.0 wandb==0.12.17 protobuf==3.20.3 scipy==1.9.1 -o img2dataset_pyspark.pex

I can verify that this new img2dataset_pyspark.pex does support output_format="tfrecord", but only with distributor=multiprocessing. Strangely, the program still fails with distributed downloading with distributor=pyspark.

— Reply to this email directly, view it on GitHub https://github.com/rom1504/img2dataset/issues/286#issuecomment-1506417232, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAR437RXGD6GYLWRVH54SHLXA6MENANCNFSM6AAAAAAW3HTT3E . You are receiving this because you are subscribed to this thread.Message ID: @.***>

zw615 commented 1 year ago

Thanks! I added tfrecord dependencies based on the build-pex in the Makefile and succeeded!