mosaicml / streaming

A Data Streaming Library for Efficient Neural Network Training
https://streaming.docs.mosaicml.com
Apache License 2.0
1.13k stars 142 forks source link

File exists: '/000000_epoch_shape' when using the ddp strategy from pytorch lightning #767

Open elbamos opened 2 months ago

elbamos commented 2 months ago

Environment

To reproduce

Steps to reproduce the behavior:

def get_dataloader_with_mosaic(path, batch_size, shuffle=False):
  # Utility function to clean up stale shared memory during distributed training
  clean_stale_shared_memory()

  # Creating the `StreamingDataset` object and the `StreamingDataLoader` object.
  dataset = StreamingDataset(local=path, shuffle=shuffle, batch_size=batch_size)
  return StreamingDataLoader(dataset, batch_size=batch_size, num_workers=31, drop_last=True, persistent_workers=True), dataset

eval_dataloader, eval_dataset = get_dataloader_with_mosaic(f"{data_storage_location}/mds_{experiment_name}_val", batch_size=256, shuffle=False)
train_dataloader, train_dataset = get_dataloader_with_mosaic(f"{data_storage_location}/mds_{experiment_name}_train", batch_size=32, shuffle=True)

trainer = pl.Trainer(
    accelerator='gpu', 
    devices=4, 
    strategy='ddp_notebook',
    max_epochs=10, 
    num_sanity_val_steps=0
)

trainer.fit(pretrainer, train_dataloader, val_dataloaders=eval_dataloader)

Expected behavior

I'd expect training to begin.

Additional context

-- Process 2 terminated with the following error:
Traceback (most recent call last):
  File "/databricks/python/lib/python3.11/site-packages/torch/multiprocessing/spawn.py", line 75, in _wrap
    fn(i, *args)
  File "/local_disk0/.ephemeral_nfs/envs/pythonEnv-dc84ac28-3e23-4bab-908e-384148539e68/lib/python3.11/site-packages/pytorch_lightning/strategies/launchers/multiprocessing.py", line 173, in _wrapping_function
    results = function(*args, **kwargs)
              ^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/local_disk0/.ephemeral_nfs/envs/pythonEnv-dc84ac28-3e23-4bab-908e-384148539e68/lib/python3.11/site-packages/pytorch_lightning/trainer/trainer.py", line 574, in _fit_impl
    self._run(model, ckpt_path=ckpt_path)
  File "/local_disk0/.ephemeral_nfs/envs/pythonEnv-dc84ac28-3e23-4bab-908e-384148539e68/lib/python3.11/site-packages/pytorch_lightning/trainer/trainer.py", line 981, in _run
    results = self._run_stage()
              ^^^^^^^^^^^^^^^^^
  File "/local_disk0/.ephemeral_nfs/envs/pythonEnv-dc84ac28-3e23-4bab-908e-384148539e68/lib/python3.11/site-packages/pytorch_lightning/trainer/trainer.py", line 1025, in _run_stage
    self.fit_loop.run()
  File "/local_disk0/.ephemeral_nfs/envs/pythonEnv-dc84ac28-3e23-4bab-908e-384148539e68/lib/python3.11/site-packages/pytorch_lightning/loops/fit_loop.py", line 205, in run
    self.advance()
  File "/local_disk0/.ephemeral_nfs/envs/pythonEnv-dc84ac28-3e23-4bab-908e-384148539e68/lib/python3.11/site-packages/pytorch_lightning/loops/fit_loop.py", line 363, in advance
    self.epoch_loop.run(self._data_fetcher)
  File "/local_disk0/.ephemeral_nfs/envs/pythonEnv-dc84ac28-3e23-4bab-908e-384148539e68/lib/python3.11/site-packages/pytorch_lightning/loops/training_epoch_loop.py", line 140, in run
    self.advance(data_fetcher)
  File "/local_disk0/.ephemeral_nfs/envs/pythonEnv-dc84ac28-3e23-4bab-908e-384148539e68/lib/python3.11/site-packages/pytorch_lightning/loops/training_epoch_loop.py", line 212, in advance
    batch, _, __ = next(data_fetcher)
                   ^^^^^^^^^^^^^^^^^^
  File "/local_disk0/.ephemeral_nfs/envs/pythonEnv-dc84ac28-3e23-4bab-908e-384148539e68/lib/python3.11/site-packages/pytorch_lightning/loops/fetchers.py", line 133, in __next__
    batch = super().__next__()
            ^^^^^^^^^^^^^^^^^^
  File "/local_disk0/.ephemeral_nfs/envs/pythonEnv-dc84ac28-3e23-4bab-908e-384148539e68/lib/python3.11/site-packages/pytorch_lightning/loops/fetchers.py", line 60, in __next__
    batch = next(self.iterator)
            ^^^^^^^^^^^^^^^^^^^
  File "/local_disk0/.ephemeral_nfs/envs/pythonEnv-dc84ac28-3e23-4bab-908e-384148539e68/lib/python3.11/site-packages/pytorch_lightning/utilities/combined_loader.py", line 341, in __next__
    out = next(self._iterator)
          ^^^^^^^^^^^^^^^^^^^^
  File "/local_disk0/.ephemeral_nfs/envs/pythonEnv-dc84ac28-3e23-4bab-908e-384148539e68/lib/python3.11/site-packages/pytorch_lightning/utilities/combined_loader.py", line 78, in __next__
    out[i] = next(self.iterators[i])
             ^^^^^^^^^^^^^^^^^^^^^^^
  File "/local_disk0/.ephemeral_nfs/envs/pythonEnv-dc84ac28-3e23-4bab-908e-384148539e68/lib/python3.11/site-packages/streaming/base/dataloader.py", line 58, in __iter__
    for batch in super().__iter__():
  File "/databricks/python/lib/python3.11/site-packages/torch/utils/data/dataloader.py", line 631, in __next__
    data = self._next_data()
           ^^^^^^^^^^^^^^^^^
  File "/databricks/python/lib/python3.11/site-packages/torch/utils/data/dataloader.py", line 1346, in _next_data
    return self._process_data(data)
           ^^^^^^^^^^^^^^^^^^^^^^^^
  File "/databricks/python/lib/python3.11/site-packages/torch/utils/data/dataloader.py", line 1372, in _process_data
    data.reraise()
  File "/databricks/python/lib/python3.11/site-packages/torch/_utils.py", line 705, in reraise
    raise exception
FileExistsError: Caught FileExistsError in DataLoader worker process 0.
Original Traceback (most recent call last):
  File "/databricks/python/lib/python3.11/site-packages/torch/utils/data/_utils/worker.py", line 308, in _worker_loop
    data = fetcher.fetch(index)  # type: ignore[possibly-undefined]
           ^^^^^^^^^^^^^^^^^^^^
  File "/databricks/python/lib/python3.11/site-packages/torch/utils/data/_utils/fetch.py", line 32, in fetch
    data.append(next(self.dataset_iter))
                ^^^^^^^^^^^^^^^^^^^^^^^
  File "/local_disk0/.ephemeral_nfs/envs/pythonEnv-dc84ac28-3e23-4bab-908e-384148539e68/lib/python3.11/site-packages/streaming/base/dataset.py", line 1501, in __iter__
    sample_ids = self._get_work(epoch, sample_in_epoch)
                 ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/local_disk0/.ephemeral_nfs/envs/pythonEnv-dc84ac28-3e23-4bab-908e-384148539e68/lib/python3.11/site-packages/streaming/base/dataset.py", line 1038, in _get_work
    shape_shm, data_shm = self._share_work(epoch_sample_ids)
                          ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/local_disk0/.ephemeral_nfs/envs/pythonEnv-dc84ac28-3e23-4bab-908e-384148539e68/lib/python3.11/site-packages/streaming/base/dataset.py", line 953, in _share_work
    shape_shm = SharedMemory(name=name, create=True, size=size, auto_cleanup=False)
                ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/local_disk0/.ephemeral_nfs/envs/pythonEnv-dc84ac28-3e23-4bab-908e-384148539e68/lib/python3.11/site-packages/streaming/base/shared/memory.py", line 41, in __init__
    shm = BuiltinSharedMemory(name, create, size)
          ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/lib/python3.11/multiprocessing/shared_memory.py", line 104, in __init__
    self._fd = _posixshmem.shm_open(
               ^^^^^^^^^^^^^^^^^^^^^
FileExistsError: [Errno 17] File exists: '/000000_epoch_shape'
XiaohanZhangCMU commented 2 months ago

hello @elbamos are you able to loop through the dataloader by itself? (meaning, just a pure for loop, no trainer involved). If so, does this share mem problem show up consistently? and does it show up with other trainer/launcher? Having that information would help us isolate the issues further. thanks!

elbamos commented 2 months ago

Thanks, @XiaohanZhangCMU I'm actually able to train fine as long as I'm training on one GPU. The problem arises when I try to train on multiple gpus using the ddp_notebook strategy, which launches additional processes by forking. What I suspect is going on is pytorch / pytorch lightning is not setting the environment variables that mosaicml is expecting in the forked processes.

XiaohanZhangCMU commented 2 months ago

@elbamos yes, I agree, that's a reasonable hypothesis. Can you compare the env vars on your platform with the ones that streaming expects (listed here)?

elbamos commented 2 months ago

I'm not sure how to do that, because the env vars are only set when inside the call to .fit().

At the beginning of the call to fit, lightning outputs:

LOCAL_RANK: 2 - CUDA_VISIBLE_DEVICES: [0,1,2,3]
LOCAL_RANK: 3 - CUDA_VISIBLE_DEVICES: [0,1,2,3]
LOCAL_RANK: 0 - CUDA_VISIBLE_DEVICES: [0,1,2,3]
LOCAL_RANK: 1 - CUDA_VISIBLE_DEVICES: [0,1,2,3]

which makes me think it may be setting LOCAL_RANK instead of RANK. I'm going to try to walk through the lightning code to confirm that. Unless you have advice about how to verify the env vars during the call to fit()?

elbamos commented 2 months ago

Yes, they're setting LOCAL_RANK and NODE_RANK but not RANK. https://lightning.ai/docs/pytorch/stable/accelerators/gpu_intermediate.html#distributed-data-parallel. Is there any way to make this compatible from the mosaicml side, or is this going to require a change by the lightning folks?

elbamos commented 2 months ago

@XiaohanZhangCMU just tagging you to make sure you saw the messages above... Thank you in advance for your help with this.

XiaohanZhangCMU commented 2 months ago

I never used lightning before, I am asking a few folks on the team who may have done that and can share the experience.

On the other hand, if you cannot change anything on the lightning end, maybe try monkeypatch this file to derive the missing env vars from Lightning? For example,

def get_rank() -> int:
    """Returns the rank of the current process, which is on ``[0; WORLD_SIZE - 1]``.

    Returns:
        int: The rank.
    """
    #return int(os.environ.get('RANK', 0))
    return os.environ.get('NODE_RANK') * (num of gpus per node) + os.environ.get('local_rank')
XiaohanZhangCMU commented 2 months ago

Yes, they're setting LOCAL_RANK and NODE_RANK but not RANK. https://lightning.ai/docs/pytorch/stable/accelerators/gpu_intermediate.html#distributed-data-parallel. Is there any way to make this compatible from the mosaicml side, or is this going to require a change by the lightning folks?

Yeah, that explains why the file exists error. Streaming relies on rank to detect workers, nodes etc.

elbamos commented 2 months ago

Actually - I think I solved this. The StreamingDataset needs to be initialized in the forked process rather than in the master process and pickled. Then it runs properly. Sorry for the misdirection.

XiaohanZhangCMU commented 2 months ago

@elbamos Great. Before closing the issue, can you elaborate a bit more what was the root cause and the resolution you arrived? I'm sure it is valuable learning for other users as well. Thank you!

elbamos commented 2 months ago

The root cause of the issue is that pytorch lightning doesn't properly set the RANK environment variable in processes launched in ddp_notebook mode.

I have a partial solution with two parts:

  1. Instead of instantiating the StreamingDataset in the master process and serializing it to the subprocesses, create a pytorch lightning DataModule that instantiates the StreamingDataset in its setup method.
  2. Add a callback to set the appropriate environment variables:
class EarlyEnvironmentSetter(Callback):
    def __init__(self):
        super().__init__()
        self.rank_set = False

    def setup(self, trainer, pl_module, stage):
        if not self.rank_set: 
            world_size = trainer.num_devices
            local_rank = trainer.strategy.local_rank

            os.environ['WORLD_SIZE'] = str(world_size)
            os.environ['LOCAL_WORLD_SIZE'] = str(world_size)
            os.environ['LOCAL_RANK'] = str(local_rank)
            os.environ['RANK'] = str(local_rank)

            self.rank_set = True

While this runs on hardware with 4 gpus, performance is seriously degraded. I get 3-4 it/s on one GPU, I get .8 it/s on 4 GPUs. It isn't clear to me whether this is caused by a misconfiguration of mosaic streaming, or whether its to be expected from the ddp_notebook strategy.

On 8 gpus, however, the call to instantiate the StreamingDataset fails with this error:

FileExistsError: [Errno 17] File exists: '/000012_locals'

where the number preceding "locals" changes each run.

The stack trace is:

  File "/root/.ipykernel/3774/command-3760228790545520-2502499197", line 23, in setup
    self.train_dataset = StreamingDataset(local=f"{data_storage_location}/mds_{experiment_name}_train", shuffle=True, batch_size=self.train_batch_size)
                         ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/local_disk0/.ephemeral_nfs/envs/pythonEnv-636537cb-4a5f-463f-be27-bc3277f07b7a/lib/python3.11/site-packages/streaming/base/dataset.py", line 529, in __init__
    self._shm_prefix_int, self._locals_shm = get_shm_prefix(streams_local, streams_remote,
                                             ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/local_disk0/.ephemeral_nfs/envs/pythonEnv-636537cb-4a5f-463f-be27-bc3277f07b7a/lib/python3.11/site-packages/streaming/base/shared/prefix.py", line 192, in get_shm_prefix
    shm = SharedMemory(name, True, len(data))
          ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/local_disk0/.ephemeral_nfs/envs/pythonEnv-636537cb-4a5f-463f-be27-bc3277f07b7a/lib/python3.11/site-packages/streaming/base/shared/memory.py", line 41, in __init__
    shm = BuiltinSharedMemory(name, create, size)
          ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/lib/python3.11/multiprocessing/shared_memory.py", line 104, in __init__
    self._fd = _posixshmem.shm_open(
               ^^^^^^^^^^^^^^^^^^^^^

For those reasons, I'm leaving this open, and tagging @XiaohanZhangCMU one more time to see if he has any advice?

elbamos commented 2 months ago

One amendment:

Adding

            os.environ['MASTER_ADDR'] = '127.0.0.1'
            os.environ['MASTER_PORT'] = str('12355')

to the callback enabled it to launch on 8 gpus, but performance fell to .26 it/s.

XiaohanZhangCMU commented 2 months ago

@elbamos sorry not many of us have hands-on experience with lightning, so not much insights can offer here. (do you consider switching to composer?)

Streaming uses SharedMemory and resource_tracker to orchestrate processes and manipulate shared arrays/scalars etc. I am not very sure whether "create a pytorch lightning DataModule that instantiates the StreamingDataset" would comply with that design, which may be the main source of performance degradation.

elbamos commented 2 months ago

I am considering switching to composer; I'm not sure if I can run composer on multiple gpus from a notebook though?

Using the DataModule means that the call to create the StreamingDataset() happens multiple times, once in each spawned process.

XiaohanZhangCMU commented 2 months ago

Using the DataModule means that the call to create the StreamingDataset() happens multiple times, once in each spawned process.

That messes up with streaming's initialization.

If you are running a notebook, have you tried torchdistributor + lightning? E.g.,

def main_training_loop(log_path, num_gpus, num_nodes):
  import torch
  from lightning import Trainer
  torch.set_float32_matmul_precision(precision="medium")

  # device_stats = DeviceStatsMonitor()
  trainer = Trainer(accelerator="gpu",
                    devices=num_gpus,
                    num_nodes=num_nodes, 
                    strategy="ddp_notebook") 
  trainer.fit(model=..., datamodule=....)

from pyspark.ml.torch.distributor import TorchDistributor 
NUM_PROCESSES = 2 # 2 gpus
output = TorchDistributor(num_processes=NUM_PROCESSES, local_mode=True, use_gpu=True)\
  .run(main_training_loop)
elbamos commented 2 months ago

Thank you for the torch distributor suggestion. That looks like a potentially promising approach. I was able to get it running with some work. But -

If I create the StreamingDataset directly inside main_training_loop, I get an NCCL error. If I use a DataModule and create the StreamingDataset from the setup function, training does begin, but performance drops to 0.02 it/s (on 8 GPUs).

jbohnslav commented 2 months ago

Hi, I use lightning with Mosaic Streaming. The trick is to launch your training script with torchrun. Then everything more or less works.

XiaohanZhangCMU commented 2 months ago

@elbamos Can you try trochrun as what @jbohnslav suggested? Let us know if it works.

elbamos commented 2 months ago

I've been trying that this morning, thank you to both of you.

Executing torchrun directly in the Databricks notebook environment doesn't work, because it doesn't see manually installed Python packages. Calling the pyspark torch distributor with the path to a file instead of with a function, however, according to the documentation, calls torchrun under the hood. I've been trying that. The code executes, but I'm still seeing the performance drop, to 0.02 it/s on 4 GPUs. (It does go up to .36 it/s if I set the number of workers to 0. It isn't clear to me from the streaming documentation whether the number of workers should be 0 or the number of available cores / num_gpus, so I've tried it both ways. Interestingly, the validation speed is 11 it/s with one worker per core, and 0.04 it/s with the number of workers set to 0.)

@jbohnslav can you share any more details of your configuration? Are you building the StreamingDataset inside a DataModule? Using local_mode? Have arguments to torchrun? Using lightning CLI?

jbohnslav commented 2 months ago

I think you're seeing two separate issues: if you can't get streaming dataset to work at all with pytorch lightning, then torchrun is our solution. If you're having throughput issues, configuring the Streaming Dataset for optimal performance is a pretty complex endeavor with lots of things to try.

Executing torchrun directly in the Databricks notebook environment doesn't work, because it doesn't see manually installed Python packages.

I can't help with a databricks notebook environment. If you can't call torchrun at a command line, you can just import it like so: from torch.distributed.run import main as torchrun.

Are you building the StreamingDataset inside a DataModule? Using local_mode? Have arguments to torchrun? Using lightning CLI?

We are building the dataset in a DataModule. I'm not sure what local_mode is. We have arguments to torchrun depending on the number of GPUs, nodes, etc. We're using the c10d backend. We're launching from our own python script, not torchrun at the command line or the lightning CLI.

AugustDev commented 1 month ago

Also getting something similar

[rank1]: Traceback (most recent call last):
[rank1]:   File "/home/august/cfdx/ai/dna_fm/lightning_gym.py", line 59, in <module>
[rank1]:     main()
[rank1]:   File "/home/august/cfdx/ai/dna_fm/lightning_gym.py", line 55, in main
[rank1]:     run_experiment(config)
[rank1]:   File "/home/august/cfdx/ai/dna_fm/lightning_gym.py", line 24, in run_experiment
[rank1]:     trainer.fit(
[rank1]:   File "/opt/conda/lib/python3.10/site-packages/lightning/pytorch/trainer/trainer.py", line 538, in fit
[rank1]:     call._call_and_handle_interrupt(
[rank1]:   File "/opt/conda/lib/python3.10/site-packages/lightning/pytorch/trainer/call.py", line 46, in _call_and_handle_interrupt
[rank1]:     return trainer.strategy.launcher.launch(trainer_fn, *args, trainer=trainer, **kwargs)
[rank1]:   File "/opt/conda/lib/python3.10/site-packages/lightning/pytorch/strategies/launchers/subprocess_script.py", line 105, in launch
[rank1]:     return function(*args, **kwargs)
[rank1]:   File "/opt/conda/lib/python3.10/site-packages/lightning/pytorch/trainer/trainer.py", line 574, in _fit_impl
[rank1]:     self._run(model, ckpt_path=ckpt_path)
[rank1]:   File "/opt/conda/lib/python3.10/site-packages/lightning/pytorch/trainer/trainer.py", line 943, in _run
[rank1]:     call._call_setup_hook(self)  # allow user to set up LightningModule in accelerator environment
[rank1]:   File "/opt/conda/lib/python3.10/site-packages/lightning/pytorch/trainer/call.py", line 102, in _call_setup_hook
[rank1]:     _call_lightning_datamodule_hook(trainer, "setup", stage=fn)
[rank1]:   File "/opt/conda/lib/python3.10/site-packages/lightning/pytorch/trainer/call.py", line 189, in _call_lightning_datamodule_hook
[rank1]:     return fn(*args, **kwargs)
[rank1]:   File "/home/august/cfdx/ai/dna_fm/datasources/lightning_data_module.py", line 71, in setup
[rank1]:     datasets = DatasetFactory.create_dataset(self.data_args, self.model_args, self.tokenizer)
[rank1]:   File "/home/august/cfdx/ai/dna_fm/datasources/dataset_factory.py", line 289, in create_dataset
[rank1]:     return get_mosaic_dataset(
[rank1]:   File "/home/august/cfdx/ai/dna_fm/datasources/dataset_factory.py", line 201, in get_mosaic_dataset
[rank1]:     result[set_name] = MosaicDatasetWithProcessing(
[rank1]:   File "/home/august/cfdx/ai/dna_fm/datasources/dataset_factory.py", line 122, in __init__
[rank1]:     super().__init__(
[rank1]:   File "/opt/conda/lib/python3.10/site-packages/streaming/base/dataset.py", line 529, in __init__
[rank1]:     self._shm_prefix_int, self._locals_shm = get_shm_prefix(streams_local, streams_remote,
[rank1]:   File "/opt/conda/lib/python3.10/site-packages/streaming/base/shared/prefix.py", line 192, in get_shm_prefix
[rank1]:     shm = SharedMemory(name, True, len(data))
[rank1]:   File "/opt/conda/lib/python3.10/site-packages/streaming/base/shared/memory.py", line 41, in __init__
[rank1]:     shm = BuiltinSharedMemory(name, create, size)
[rank1]:   File "/opt/conda/lib/python3.10/multiprocessing/shared_memory.py", line 104, in __init__
[rank1]:     self._fd = _posixshmem.shm_open(
[rank1]: FileExistsError: [Errno 17] File exists: '/000006_locals'
/opt/conda/lib/python3.10/multiprocessing/resource_tracker.py:224: UserWarning: resource_tracker: There appear to be 10 leaked shared_memory objects to clean up at shutdown
  warnings.warn('resource_tracker: There appear to be %d '
/opt/conda/lib/python3.10/multiprocessing/resource_tracker.py:237: UserWarning: resource_tracker: '/000003_shard_states': [Errno 2] No such file or directory: '/000003_shard_states'
  warnings.warn('resource_tracker: %r: %s' % (name, e))
/opt/conda/lib/python3.10/multiprocessing/resource_tracker.py:237: UserWarning: resource_tracker: '/000006_shard_access_times': [Errno 2] No such file or directory: '/000006_shard_access_times'
  warnings.warn('resource_tracker: %r: %s' % (name, e))
/opt/conda/lib/python3.10/multiprocessing/resource_tracker.py:237: UserWarning: resource_tracker: '/000003_cache_usage': [Errno 2] No such file or directory: '/000003_cache_usage'
  warnings.warn('resource_tracker: %r: %s' % (name, e))
/opt/conda/lib/python3.10/multiprocessing/resource_tracker.py:237: UserWarning: resource_tracker: '/000003_next_epoch': [Errno 2] No such file or directory: '/000003_next_epoch'
  warnings.warn('resource_tracker: %r: %s' % (name, e))
/opt/conda/lib/python3.10/multiprocessing/resource_tracker.py:237: UserWarning: resource_tracker: '/000003_barrier': [Errno 2] No such file or directory: '/000003_barrier'
  warnings.warn('resource_tracker: %r: %s' % (name, e))
/opt/conda/lib/python3.10/multiprocessing/resource_tracker.py:237: UserWarning: resource_tracker: '/000006_next_epoch': [Errno 2] No such file or directory: '/000006_next_epoch'
  warnings.warn('resource_tracker: %r: %s' % (name, e))
/opt/conda/lib/python3.10/multiprocessing/resource_tracker.py:237: UserWarning: resource_tracker: '/000006_cache_usage': [Errno 2] No such file or directory: '/000006_cache_usage'
  warnings.warn('resource_tracker: %r: %s' % (name, e))
/opt/conda/lib/python3.10/multiprocessing/resource_tracker.py:237: UserWarning: resource_tracker: '/000006_barrier': [Errno 2] No such file or directory: '/000006_barrier'
  warnings.warn('resource_tracker: %r: %s' % (name, e))
/opt/conda/lib/python3.10/multiprocessing/resource_tracker.py:237: UserWarning: resource_tracker: '/000006_shard_states': [Errno 2] No such file or directory: '/000006_shard_states'
  warnings.warn('resource_tracker: %r: %s' % (name, e))
/opt/conda/lib/python3.10/multiprocessing/resource_tracker.py:237: UserWarning: resource_tracker: '/000003_shard_access_times': [Errno 2] No such file or directory: '/000003_shard_access_times'
  warnings.warn('resource_tracker: %r: %s' % (name, e))
[rank: 1] Child process with PID 11951 terminated with code 1. Forcefully terminating all other processes to avoid zombies 🧟
/opt/conda/lib/python3.10/multiprocessing/resource_tracker.py:224: UserWarning: resource_tracker: There appear to be 13 leaked shared_memory objects to clean up at shutdown
  warnings.warn('resource_tracker: There appear to be %d '
/opt/conda/lib/python3.10/multiprocessing/resource_tracker.py:237: UserWarning: resource_tracker: '/000001_shard_states': [Errno 2] No such file or directory: '/000001_shard_states'
  warnings.warn('resource_tracker: %r: %s' % (name, e))
/opt/conda/lib/python3.10/multiprocessing/resource_tracker.py:237: UserWarning: resource_tracker: '/000005_shard_states': [Errno 2] No such file or directory: '/000005_shard_states'
  warnings.warn('resource_tracker: %r: %s' % (name, e))
/opt/conda/lib/python3.10/multiprocessing/resource_tracker.py:237: UserWarning: resource_tracker: '/000005_barrier': [Errno 2] No such file or directory: '/000005_barrier'
  warnings.warn('resource_tracker: %r: %s' % (name, e))
/opt/conda/lib/python3.10/multiprocessing/resource_tracker.py:237: UserWarning: resource_tracker: '/000006_locals': [Errno 2] No such file or directory: '/000006_locals'
  warnings.warn('resource_tracker: %r: %s' % (name, e))
/opt/conda/lib/python3.10/multiprocessing/resource_tracker.py:237: UserWarning: resource_tracker: '/000001_barrier': [Errno 2] No such file or directory: '/000001_barrier'
  warnings.warn('resource_tracker: %r: %s' % (name, e))
/opt/conda/lib/python3.10/multiprocessing/resource_tracker.py:237: UserWarning: resource_tracker: '/000001_next_epoch': [Errno 2] No such file or directory: '/000001_next_epoch'
  warnings.warn('resource_tracker: %r: %s' % (name, e))
/opt/conda/lib/python3.10/multiprocessing/resource_tracker.py:237: UserWarning: resource_tracker: '/000005_cache_usage': [Errno 2] No such file or directory: '/000005_cache_usage'
  warnings.warn('resource_tracker: %r: %s' % (name, e))
/opt/conda/lib/python3.10/multiprocessing/resource_tracker.py:237: UserWarning: resource_tracker: '/000001_shard_access_times': [Errno 2] No such file or directory: '/000001_shard_access_times'
  warnings.warn('resource_tracker: %r: %s' % (name, e))
/opt/conda/lib/python3.10/multiprocessing/resource_tracker.py:237: UserWarning: resource_tracker: '/000005_next_epoch': [Errno 2] No such file or directory: '/000005_next_epoch'
  warnings.warn('resource_tracker: %r: %s' % (name, e))
/opt/conda/lib/python3.10/multiprocessing/resource_tracker.py:237: UserWarning: resource_tracker: '/000005_shard_access_times': [Errno 2] No such file or directory: '/000005_shard_access_times'
  warnings.warn('resource_tracker: %r: %s' % (name, e))
/opt/conda/lib/python3.10/multiprocessing/resource_tracker.py:237: UserWarning: resource_tracker: '/000001_cache_usage': [Errno 2] No such file or directory: '/000001_cache_usage'
  warnings.warn('resource_tracker: %r: %s' % (name, e))
/opt/conda/lib/python3.10/multiprocessing/resource_tracker.py:224: UserWarning: resource_tracker: There appear to be 13 leaked shared_memory objects to clean up at shutdown
  warnings.warn('resource_tracker: There appear to be %d '
/opt/conda/lib/python3.10/multiprocessing/resource_tracker.py:237: UserWarning: resource_tracker: '/000004_barrier': [Errno 2] No such file or directory: '/000004_barrier'
  warnings.warn('resource_tracker: %r: %s' % (name, e))
/opt/conda/lib/python3.10/multiprocessing/resource_tracker.py:237: UserWarning: resource_tracker: '/000004_shard_access_times': [Errno 2] No such file or directory: '/000004_shard_access_times'
  warnings.warn('resource_tracker: %r: %s' % (name, e))
/opt/conda/lib/python3.10/multiprocessing/resource_tracker.py:237: UserWarning: resource_tracker: '/000005_locals': [Errno 2] No such file or directory: '/000005_locals'
  warnings.warn('resource_tracker: %r: %s' % (name, e))
/opt/conda/lib/python3.10/multiprocessing/resource_tracker.py:237: UserWarning: resource_tracker: '/000004_next_epoch': [Errno 2] No such file or directory: '/000004_next_epoch'
  warnings.warn('resource_tracker: %r: %s' % (name, e))
/opt/conda/lib/python3.10/multiprocessing/resource_tracker.py:237: UserWarning: resource_tracker: '/000007_locals': [Errno 2] No such file or directory: '/000007_locals'
  warnings.warn('resource_tracker: %r: %s' % (name, e))
/opt/conda/lib/python3.10/multiprocessing/resource_tracker.py:237: UserWarning: resource_tracker: '/000007_shard_states': [Errno 2] No such file or directory: '/000007_shard_states'
  warnings.warn('resource_tracker: %r: %s' % (name, e))
/opt/conda/lib/python3.10/multiprocessing/resource_tracker.py:237: UserWarning: resource_tracker: '/000007_next_epoch': [Errno 2] No such file or directory: '/000007_next_epoch'
  warnings.warn('resource_tracker: %r: %s' % (name, e))
/opt/conda/lib/python3.10/multiprocessing/resource_tracker.py:237: UserWarning: resource_tracker: '/000007_shard_access_times': [Errno 2] No such file or directory: '/000007_shard_access_times'
  warnings.warn('resource_tracker: %r: %s' % (name, e))
/opt/conda/lib/python3.10/multiprocessing/resource_tracker.py:237: UserWarning: resource_tracker: '/000007_cache_usage': [Errno 2] No such file or directory: '/000007_cache_usage'
  warnings.warn('resource_tracker: %r: %s' % (name, e))
/opt/conda/lib/python3.10/multiprocessing/resource_tracker.py:237: UserWarning: resource_tracker: '/000004_cache_usage': [Errno 2] No such file or directory: '/000004_cache_usage'
  warnings.warn('resource_tracker: %r: %s' % (name, e))
/opt/conda/lib/python3.10/multiprocessing/resource_tracker.py:237: UserWarning: resource_tracker: '/000004_shard_states': [Errno 2] No such file or directory: '/000004_shard_states'
  warnings.warn('resource_tracker: %r: %s' % (name, e))
/opt/conda/lib/python3.10/multiprocessing/resource_tracker.py:237: UserWarning: resource_tracker: '/000007_barrier': [Errno 2] No such file or directory: '/000007_barrier'
  warnings.warn('resource_tracker: %r: %s' % (name, e))
/opt/conda/lib/python3.10/multiprocessing/resource_tracker.py:224: UserWarning: resource_tracker: There appear to be 12 leaked shared_memory objects to clean up at shutdown
  warnings.warn('resource_tracker: There appear to be %d '
/opt/conda/lib/python3.10/multiprocessing/resource_tracker.py:237: UserWarning: resource_tracker: '/000004_locals': [Errno 2] No such file or directory: '/000004_locals'
  warnings.warn('resource_tracker: %r: %s' % (name, e))
/opt/conda/lib/python3.10/multiprocessing/resource_tracker.py:237: UserWarning: resource_tracker: '/000000_cache_usage': [Errno 2] No such file or directory: '/000000_cache_usage'
  warnings.warn('resource_tracker: %r: %s' % (name, e))
/opt/conda/lib/python3.10/multiprocessing/resource_tracker.py:237: UserWarning: resource_tracker: '/000004_barrier': [Errno 2] No such file or directory: '/000004_barrier'
  warnings.warn('resource_tracker: %r: %s' % (name, e))
/opt/conda/lib/python3.10/multiprocessing/resource_tracker.py:237: UserWarning: resource_tracker: '/000004_cache_usage': [Errno 2] No such file or directory: '/000004_cache_usage'
  warnings.warn('resource_tracker: %r: %s' % (name, e))
/opt/conda/lib/python3.10/multiprocessing/resource_tracker.py:237: UserWarning: resource_tracker: '/000004_shard_access_times': [Errno 2] No such file or directory: '/000004_shard_access_times'
  warnings.warn('resource_tracker: %r: %s' % (name, e))
/opt/conda/lib/python3.10/multiprocessing/resource_tracker.py:237: UserWarning: resource_tracker: '/000000_shard_states': [Errno 2] No such file or directory: '/000000_shard_states'
  warnings.warn('resource_tracker: %r: %s' % (name, e))
/opt/conda/lib/python3.10/multiprocessing/resource_tracker.py:237: UserWarning: resource_tracker: '/000000_shard_access_times': [Errno 2] No such file or directory: '/000000_shard_access_times'
  warnings.warn('resource_tracker: %r: %s' % (name, e))
/opt/conda/lib/python3.10/multiprocessing/resource_tracker.py:237: UserWarning: resource_tracker: '/000004_next_epoch': [Errno 2] No such file or directory: '/000004_next_epoch'
  warnings.warn('resource_tracker: %r: %s' % (name, e))
/opt/conda/lib/python3.10/multiprocessing/resource_tracker.py:237: UserWarning: resource_tracker: '/000000_barrier': [Errno 2] No such file or directory: '/000000_barrier'
  warnings.warn('resource_tracker: %r: %s' % (name, e))
/opt/conda/lib/python3.10/multiprocessing/resource_tracker.py:237: UserWarning: resource_tracker: '/000000_next_epoch': [Errno 2] No such file or directory: '/000000_next_epoch'
  warnings.warn('resource_tracker: %r: %s' % (name, e))
/opt/conda/lib/python3.10/multiprocessing/resource_tracker.py:237: UserWarning: resource_tracker: '/000004_shard_states': [Errno 2] No such file or directory: '/000004_shard_states'
  warnings.warn('resource_tracker: %r: %s' % (name, e))
/opt/conda/lib/python3.10/multiprocessing/resource_tracker.py:224: UserWarning: resource_tracker: There appear to be 12 leaked shared_memory objects to clean up at shutdown
  warnings.warn('resource_tracker: There appear to be %d '
/opt/conda/lib/python3.10/multiprocessing/resource_tracker.py:237: UserWarning: resource_tracker: '/000006_next_epoch': [Errno 2] No such file or directory: '/000006_next_epoch'
  warnings.warn('resource_tracker: %r: %s' % (name, e))
/opt/conda/lib/python3.10/multiprocessing/resource_tracker.py:237: UserWarning: resource_tracker: '/000003_next_epoch': [Errno 2] No such file or directory: '/000003_next_epoch'
  warnings.warn('resource_tracker: %r: %s' % (name, e))
/opt/conda/lib/python3.10/multiprocessing/resource_tracker.py:237: UserWarning: resource_tracker: '/000003_shard_states': [Errno 2] No such file or directory: '/000003_shard_states'
  warnings.warn('resource_tracker: %r: %s' % (name, e))
/opt/conda/lib/python3.10/multiprocessing/resource_tracker.py:237: UserWarning: resource_tracker: '/000003_shard_access_times': [Errno 2] No such file or directory: '/000003_shard_access_times'
  warnings.warn('resource_tracker: %r: %s' % (name, e))
/opt/conda/lib/python3.10/multiprocessing/resource_tracker.py:224: UserWarning: resource_tracker: There appear to be 13 leaked shared_memory objects to clean up at shutdown
  warnings.warn('resource_tracker: There appear to be %d '
/opt/conda/lib/python3.10/multiprocessing/resource_tracker.py:237: UserWarning: resource_tracker: '/000006_shard_states': [Errno 2] No such file or directory: '/000006_shard_states'
  warnings.warn('resource_tracker: %r: %s' % (name, e))
/opt/conda/lib/python3.10/multiprocessing/resource_tracker.py:237: UserWarning: resource_tracker: '/000003_cache_usage': [Errno 2] No such file or directory: '/000003_cache_usage'
  warnings.warn('resource_tracker: %r: %s' % (name, e))
/opt/conda/lib/python3.10/multiprocessing/resource_tracker.py:237: UserWarning: resource_tracker: '/000003_barrier': [Errno 2] No such file or directory: '/000003_barrier'
  warnings.warn('resource_tracker: %r: %s' % (name, e))
/opt/conda/lib/python3.10/multiprocessing/resource_tracker.py:237: UserWarning: resource_tracker: '/000006_shard_access_times': [Errno 2] No such file or directory: '/000006_shard_access_times'
  warnings.warn('resource_tracker: %r: %s' % (name, e))
/opt/conda/lib/python3.10/multiprocessing/resource_tracker.py:237: UserWarning: resource_tracker: '/000005_cache_usage': [Errno 2] No such file or directory: '/000005_cache_usage'
  warnings.warn('resource_tracker: %r: %s' % (name, e))
/opt/conda/lib/python3.10/multiprocessing/resource_tracker.py:237: UserWarning: resource_tracker: '/000003_locals': [Errno 2] No such file or directory: '/000003_locals'
  warnings.warn('resource_tracker: %r: %s' % (name, e))
/opt/conda/lib/python3.10/multiprocessing/resource_tracker.py:237: UserWarning: resource_tracker: '/000006_locals': [Errno 2] No such file or directory: '/000006_locals'
  warnings.warn('resource_tracker: %r: %s' % (name, e))
/opt/conda/lib/python3.10/multiprocessing/resource_tracker.py:237: UserWarning: resource_tracker: '/000001_cache_usage': [Errno 2] No such file or directory: '/000001_cache_usage'
  warnings.warn('resource_tracker: %r: %s' % (name, e))
/opt/conda/lib/python3.10/multiprocessing/resource_tracker.py:237: UserWarning: resource_tracker: '/000006_cache_usage': [Errno 2] No such file or directory: '/000006_cache_usage'
  warnings.warn('resource_tracker: %r: %s' % (name, e))
/opt/conda/lib/python3.10/multiprocessing/resource_tracker.py:237: UserWarning: resource_tracker: '/000006_barrier': [Errno 2] No such file or directory: '/000006_barrier'
  warnings.warn('resource_tracker: %r: %s' % (name, e))
/opt/conda/lib/python3.10/multiprocessing/resource_tracker.py:237: UserWarning: resource_tracker: '/000001_shard_states': [Errno 2] No such file or directory: '/000001_shard_states'
  warnings.warn('resource_tracker: %r: %s' % (name, e))
/opt/conda/lib/python3.10/multiprocessing/resource_tracker.py:237: UserWarning: resource_tracker: '/000005_locals': [Errno 2] No such file or directory: '/000005_locals'
  warnings.warn('resource_tracker: %r: %s' % (name, e))
/opt/conda/lib/python3.10/multiprocessing/resource_tracker.py:237: UserWarning: resource_tracker: '/000001_next_epoch': [Errno 2] No such file or directory: '/000001_next_epoch'
  warnings.warn('resource_tracker: %r: %s' % (name, e))
/opt/conda/lib/python3.10/multiprocessing/resource_tracker.py:237: UserWarning: resource_tracker: '/000005_shard_states': [Errno 2] No such file or directory: '/000005_shard_states'
  warnings.warn('resource_tracker: %r: %s' % (name, e))
/opt/conda/lib/python3.10/multiprocessing/resource_tracker.py:237: UserWarning: resource_tracker: '/000005_barrier': [Errno 2] No such file or directory: '/000005_barrier'
  warnings.warn('resource_tracker: %r: %s' % (name, e))
/opt/conda/lib/python3.10/multiprocessing/resource_tracker.py:237: UserWarning: resource_tracker: '/000005_shard_access_times': [Errno 2] No such file or directory: '/000005_shard_access_times'
  warnings.warn('resource_tracker: %r: %s' % (name, e))
/opt/conda/lib/python3.10/multiprocessing/resource_tracker.py:237: UserWarning: resource_tracker: '/000005_next_epoch': [Errno 2] No such file or directory: '/000005_next_epoch'
  warnings.warn('resource_tracker: %r: %s' % (name, e))
/opt/conda/lib/python3.10/multiprocessing/resource_tracker.py:237: UserWarning: resource_tracker: '/000001_locals': [Errno 2] No such file or directory: '/000001_locals'
  warnings.warn('resource_tracker: %r: %s' % (name, e))
/opt/conda/lib/python3.10/multiprocessing/resource_tracker.py:237: UserWarning: resource_tracker: '/000001_barrier': [Errno 2] No such file or directory: '/000001_barrier'
  warnings.warn('resource_tracker: %r: %s' % (name, e))
/opt/conda/lib/python3.10/multiprocessing/resource_tracker.py:237: UserWarning: resource_tracker: '/000001_shard_access_times': [Errno 2] No such file or directory: '/000001_shard_access_times'
  warnings.warn('resource_tracker: %r: %s' % (name, e))
/opt/conda/lib/python3.10/multiprocessing/resource_tracker.py:224: UserWarning: resource_tracker: There appear to be 7 leaked shared_memory objects to clean up at shutdown
  warnings.warn('resource_tracker: There appear to be %d '
/opt/conda/lib/python3.10/multiprocessing/resource_tracker.py:237: UserWarning: resource_tracker: '/000002_next_epoch': [Errno 2] No such file or directory: '/000002_next_epoch'
  warnings.warn('resource_tracker: %r: %s' % (name, e))
/opt/conda/lib/python3.10/multiprocessing/resource_tracker.py:237: UserWarning: resource_tracker: '/000002_barrier': [Errno 2] No such file or directory: '/000002_barrier'
  warnings.warn('resource_tracker: %r: %s' % (name, e))
/opt/conda/lib/python3.10/multiprocessing/resource_tracker.py:237: UserWarning: resource_tracker: '/000002_shard_states': [Errno 2] No such file or directory: '/000002_shard_states'
  warnings.warn('resource_tracker: %r: %s' % (name, e))
/opt/conda/lib/python3.10/multiprocessing/resource_tracker.py:237: UserWarning: resource_tracker: '/000002_cache_usage': [Errno 2] No such file or directory: '/000002_cache_usage'
  warnings.warn('resource_tracker: %r: %s' % (name, e))
/opt/conda/lib/python3.10/multiprocessing/resource_tracker.py:237: UserWarning: resource_tracker: '/000002_shard_access_times': [Errno 2] No such file or directory: '/000002_shard_access_times'
  warnings.warn('resource_tracker: %r: %s' % (name, e))
/opt/conda/lib/python3.10/multiprocessing/resource_tracker.py:237: UserWarning: resource_tracker: '/000001_locals': [Errno 2] No such file or directory: '/000001_locals'
  warnings.warn('resource_tracker: %r: %s' % (name, e))
/opt/conda/lib/python3.10/multiprocessing/resource_tracker.py:237: UserWarning: resource_tracker: '/000002_locals': [Errno 2] No such file or directory: '/000002_locals'
  warnings.warn('resource_tracker: %r: %s' % (name, e))
/opt/conda/lib/python3.10/multiprocessing/resource_tracker.py:224: UserWarning: resource_tracker: There appear to be 1 leaked shared_memory objects to clean up at shutdown
  warnings.warn('resource_tracker: There appear to be %d '
/opt/conda/lib/python3.10/multiprocessing/resource_tracker.py:237: UserWarning: resource_tracker: '/000004_locals': [Errno 2] No such file or directory: '/000004_locals'
  warnings.warn('resource_tracker: %r: %s' % (name, e))
snarayan21 commented 1 month ago

@elbamos As mentioned, torchrun or torch distributor work with StreamingDataset, in addition to Composer. From a Databricks notebook, torch distributor should make launching your job easy.

@jbohnslav Regarding:

If you're having throughput issues, configuring the Streaming Dataset for optimal performance is a pretty complex endeavor with lots of things to try. We've built the Streaming simulator for exactly this issue -- if you're seeing dataloader bottlenecks or want to optimize dataloading performance, we highly recommend using it.

snarayan21 commented 1 month ago

@AugustDev You filed #781, correct? @XiaohanZhangCMU's recommendations there make sense to me -- you can see the currently running processes with top and kill them. Then clear your stale shared memory and rerun training.