ray-project / ray

Ray is an AI compute engine. Ray consists of a core distributed runtime and a set of AI Libraries for accelerating ML workloads.
https://ray.io
Apache License 2.0
33.99k stars 5.78k forks source link

[Train] Cannot use fullsync datapipe with Torch Trainer, which prevents multi epoch training when using torch data #30324

Closed vedantroy closed 1 year ago

vedantroy commented 2 years ago

What happened + What you expected to happen

Copied from: https://discuss.ray.io/t/get-distributed-process-group-timeout-when-using-torch-trainer-fullsynciterdatapipe/8075, as I'm not sure if this is a better place to be submitting bug reports.

This line: data/prefetch.py at 4ea88d1fb4d279def9213a23b054b4e7d46d5b3d · pytorch/data · GitHub 3 times out when using the TorchTrainer. This means the training script never runs, since it gets stuck on initializing the dataloader.

This happens when using a data loader made out of torch data datapipes, with the fullsync data pipe at the end. This is a significant problem because it prevents multi-epoch training (fullsync is necessary for multi-epoch training if each datapipe can have different length) when using torchdata, which is Pytorch's new data-loading mechanism.

Versions / Dependencies

OS:

No LSB modules are available.
Distributor ID: Ubuntu
Description:    Ubuntu 22.04.1 LTS
Release:    22.04
Codename:   jammy

Deps:

Package                                Version                  Editable project location
-------------------------------------- ------------------------ -------------------------
adal                                   1.2.7
aiofiles                               22.1.0
aiohttp                                3.8.3
aiohttp-cors                           0.7.0
aiorwlock                              1.3.0
aiosignal                              1.2.0
anyio                                  3.6.2
applicationinsights                    0.11.10
argcomplete                            1.12.3
asttokens                              2.1.0
async-timeout                          4.0.2
attrs                                  22.1.0
av                                     9.2.0
azure-cli-core                         2.40.0
azure-cli-telemetry                    1.0.8
azure-common                           1.1.28
azure-core                             1.26.0
azure-identity                         1.10.0
azure-mgmt-compute                     23.1.0
azure-mgmt-core                        1.3.2
azure-mgmt-network                     19.0.0
azure-mgmt-resource                    20.0.0
backcall                               0.2.0
backoff                                1.10.0
bcrypt                                 4.0.1
black                                  22.10.0
blessed                                1.19.1
boto3                                  1.26.8
boto3-stubs                            1.25.5
botocore                               1.29.8
botocore-stubs                         1.28.5
brotlipy                               0.7.0
cachetools                             5.2.0
certifi                                2022.9.24
cffi                                   1.15.1
charset-normalizer                     2.0.4
click                                  8.0.4
cloudpickle                            2.2.0
colorful                               0.5.4
commonmark                             0.9.1
conda                                  22.9.0
conda-package-handling                 1.9.0
contourpy                              1.0.5
cryptography                           38.0.1
cursor                                 1.3.5
cycler                                 0.11.0
Cython                                 0.29.26
debugpy                                1.5.1
decorator                              5.1.1
dill                                   0.3.6
distlib                                0.3.6
distributed-ml                         0.0.0                    /app
dm-tree                                0.1.7
docker-pycreds                         0.4.0
docutils                               0.19
einops                                 0.6.0
entrypoints                            0.4
executing                              1.2.0
fastapi                                0.85.1
filelock                               3.8.0
flash-attn                             0.1
flatbuffers                            22.9.24
fonttools                              4.37.4
frozenlist                             1.3.1
fsspec                                 2022.10.0
gitdb                                  4.0.9
GitPython                              3.1.29
google-api-core                        2.10.2
google-api-python-client               1.7.8
google-auth                            2.13.0
google-auth-httplib2                   0.1.0
google-oauth                           1.0.1
googleapis-common-protos               1.56.4
gpustat                                1.0.0
grpcio                                 1.50.0
gym                                    0.23.1
gym-notices                            0.0.8
h11                                    0.14.0
halo                                   0.0.29
httplib2                               0.20.4
humanfriendly                          10.0
humanize                               4.4.0
idna                                   3.4
imageio                                2.22.2
importlib-metadata                     5.0.0
importlib-resources                    5.10.0
ipykernel                              6.15.2
ipython                                8.6.0
iso8601                                1.1.0
isodate                                0.6.1
jedi                                   0.18.1
jmespath                               0.10.0
jsonschema                             4.16.0
jupyter_client                         7.3.5
jupyter_core                           4.11.2
kiwisolver                             1.4.4
knack                                  0.10.0
kopf                                   1.35.6
kubernetes                             24.2.0
libcst                                 0.4.9
log-symbols                            0.0.14
logfmt                                 0.4
lz4                                    4.0.2
matplotlib                             3.6.1
matplotlib-inline                      0.1.6
moreorless                             0.4.0
mpmath                                 1.2.1
msal                                   1.18.0b1
msal-extensions                        1.0.0
msgpack                                1.0.4
msrest                                 0.7.1
msrestazure                            0.6.4
multidict                              6.0.2
mypy-boto3-cloudformation              1.25.4
mypy-boto3-dynamodb                    1.25.0
mypy-boto3-ec2                         1.25.5
mypy-boto3-lambda                      1.25.0
mypy-boto3-rds                         1.25.1
mypy-boto3-s3                          1.25.0
mypy-boto3-sqs                         1.25.0
mypy-extensions                        0.4.3
nest-asyncio                           1.5.6
networkx                               2.8.7
numpy                                  1.23.4
nvidia-ml-py                           11.495.46
oauthlib                               3.2.2
opencensus                             0.11.0
opencensus-context                     0.1.3
opentelemetry-api                      1.1.0
opentelemetry-exporter-otlp            1.1.0
opentelemetry-exporter-otlp-proto-grpc 1.1.0
opentelemetry-proto                    1.1.0
opentelemetry-sdk                      1.1.0
opentelemetry-semantic-conventions     0.20b0
packaging                              21.3
pandas                                 1.5.1
paramiko                               2.11.0
parquet-tools                          0.2.11
parso                                  0.8.3
pathspec                               0.10.2
pathtools                              0.1.2
pexpect                                4.8.0
pickleshare                            0.7.5
pillow                                 9.0.0
Pillow-SIMD                            9.0.0.post1
pip                                    22.2.2
pkginfo                                1.8.3
pkgutil_resolve_name                   1.3.10
platformdirs                           2.5.2
portalocker                            2.6.0
prometheus-client                      0.13.1
promise                                2.3
prompt-toolkit                         3.0.32
protobuf                               3.19.6
psutil                                 5.9.3
psycopg2-binary                        2.9.5
ptyprocess                             0.7.0
pure-eval                              0.2.2
py-spy                                 0.3.14
pyarrow                                6.0.1
pyasn1                                 0.4.8
pyasn1-modules                         0.2.8
pycosat                                0.6.4
pycparser                              2.21
pydantic                               1.10.2
pydash                                 5.1.1
Pygments                               2.13.0
PyJWT                                  2.6.0
PyNaCl                                 1.5.0
pyOpenSSL                              22.0.0
pyparsing                              3.0.9
pyrsistent                             0.18.1
PySocks                                1.7.1
python-dateutil                        2.8.2
python-dotenv                          0.21.0
python-json-logger                     2.0.4
pytz                                   2022.5
PyWavelets                             1.4.1
PyYAML                                 6.0
pyzmq                                  23.2.0
ray                                    3.0.0.dev0
redis                                  3.5.3
requests                               2.28.1
requests-oauthlib                      1.3.1
rich                                   12.6.0
rsa                                    4.9
ruamel-yaml-conda                      0.15.100
s3transfer                             0.6.0
scikit-image                           0.19.3
scipy                                  1.9.3
sentry-sdk                             1.10.1
setproctitle                           1.3.2
setuptools                             65.5.0
shortuuid                              1.0.11
six                                    1.16.0
smart-open                             6.2.0
smmap                                  5.0.0
sniffio                                1.3.0
spinners                               0.0.24
stack-data                             0.6.1
starlette                              0.20.4
stdlibs                                2022.10.9
structlog                              22.1.0
sympy                                  1.11.1
tabulate                               0.8.10
tensorboardX                           2.5.1
termcolor                              2.1.0
thrift                                 0.13.0
tifffile                               2022.10.10
tokenize-rt                            5.0.0
toml                                   0.10.2
tomli                                  2.0.1
toolz                                  0.12.0
torch                                  1.14.0.dev20221027+cu116
torchdata                              0.6.0.dev20221027
torchsnapshot-nightly                  2022.10.29
torchvision                            0.15.0a0+edb3a80
tornado                                6.2
tqdm                                   4.64.1
trailrunner                            1.2.1
traitlets                              5.5.0
typer                                  0.6.1
types-awscrt                           0.15.3
types-s3transfer                       0.6.0.post4
typing_extensions                      4.4.0
typing-inspect                         0.8.0
uritemplate                            3.0.1
urllib3                                1.26.12
usort                                  1.0.5
uvicorn                                0.19.0
virtualenv                             20.16.5
wandb                                  0.13.4
wcwidth                                0.2.5
websocket-client                       1.4.1
wheel                                  0.37.1
yarl                                   1.8.1
zipp                                   3.9.0

Docker image: rayproject/ray:6f5f1e-py38-cu116.

Reproduction script

import ray
import ray.train.torch as ray_torch
import torchdata.datapipes.iter as pipes
from ray.air import ScalingConfig
from torch.utils.data import DataLoader

ray.init()

def loader():
    pipe = pipes.IterableWrapper(list(range(2000)))
    pipe = pipe.batch(5)
    pipe = pipe.fullsync()
    return DataLoader(pipe, batch_size=None, num_workers=5)

def train_loop():
    dl = loader()
    print("TRYING TO CREATE SAMPLE")
    x = next(iter(dl))
    print("THIS LINE NEVER GETS PRINTED")

trainer = ray_torch.TorchTrainer(
    train_loop_per_worker=train_loop,
    scaling_config=ScalingConfig(num_workers=2, use_gpu=False),
)
trainer.fit()

Issue Severity

High: It blocks me from completing my task.

bveeramani commented 1 year ago

Closing because the underlying issue is with Torch DataPipe: https://github.com/pytorch/data/issues/868.