Open elbamos opened 3 months ago
hello @elbamos are you able to loop through the dataloader by itself? (meaning, just a pure for loop, no trainer involved). If so, does this share mem problem show up consistently? and does it show up with other trainer/launcher? Having that information would help us isolate the issues further. thanks!
Thanks, @XiaohanZhangCMU I'm actually able to train fine as long as I'm training on one GPU. The problem arises when I try to train on multiple gpus using the ddp_notebook
strategy, which launches additional processes by forking. What I suspect is going on is pytorch / pytorch lightning is not setting the environment variables that mosaicml is expecting in the forked processes.
@elbamos yes, I agree, that's a reasonable hypothesis. Can you compare the env vars on your platform with the ones that streaming expects (listed here)?
I'm not sure how to do that, because the env vars are only set when inside the call to .fit()
.
At the beginning of the call to fit
, lightning outputs:
LOCAL_RANK: 2 - CUDA_VISIBLE_DEVICES: [0,1,2,3]
LOCAL_RANK: 3 - CUDA_VISIBLE_DEVICES: [0,1,2,3]
LOCAL_RANK: 0 - CUDA_VISIBLE_DEVICES: [0,1,2,3]
LOCAL_RANK: 1 - CUDA_VISIBLE_DEVICES: [0,1,2,3]
which makes me think it may be setting LOCAL_RANK
instead of RANK
. I'm going to try to walk through the lightning code to confirm that. Unless you have advice about how to verify the env vars during the call to fit()
?
Yes, they're setting LOCAL_RANK
and NODE_RANK
but not RANK
. https://lightning.ai/docs/pytorch/stable/accelerators/gpu_intermediate.html#distributed-data-parallel. Is there any way to make this compatible from the mosaicml side, or is this going to require a change by the lightning folks?
@XiaohanZhangCMU just tagging you to make sure you saw the messages above... Thank you in advance for your help with this.
I never used lightning before, I am asking a few folks on the team who may have done that and can share the experience.
On the other hand, if you cannot change anything on the lightning end, maybe try monkeypatch this file to derive the missing env vars from Lightning? For example,
def get_rank() -> int:
"""Returns the rank of the current process, which is on ``[0; WORLD_SIZE - 1]``.
Returns:
int: The rank.
"""
#return int(os.environ.get('RANK', 0))
return os.environ.get('NODE_RANK') * (num of gpus per node) + os.environ.get('local_rank')
Yes, they're setting
LOCAL_RANK
andNODE_RANK
but notRANK
. https://lightning.ai/docs/pytorch/stable/accelerators/gpu_intermediate.html#distributed-data-parallel. Is there any way to make this compatible from the mosaicml side, or is this going to require a change by the lightning folks?
Yeah, that explains why the file exists error. Streaming relies on rank to detect workers, nodes etc.
Actually - I think I solved this. The StreamingDataset
needs to be initialized in the forked process rather than in the master process and pickled. Then it runs properly. Sorry for the misdirection.
@elbamos Great. Before closing the issue, can you elaborate a bit more what was the root cause and the resolution you arrived? I'm sure it is valuable learning for other users as well. Thank you!
The root cause of the issue is that pytorch lightning doesn't properly set the RANK
environment variable in processes launched in ddp_notebook
mode.
I have a partial solution with two parts:
StreamingDataset
in the master process and serializing it to the subprocesses, create a pytorch lightning DataModule
that instantiates the StreamingDataset
in its setup
method. class EarlyEnvironmentSetter(Callback):
def __init__(self):
super().__init__()
self.rank_set = False
def setup(self, trainer, pl_module, stage):
if not self.rank_set:
world_size = trainer.num_devices
local_rank = trainer.strategy.local_rank
os.environ['WORLD_SIZE'] = str(world_size)
os.environ['LOCAL_WORLD_SIZE'] = str(world_size)
os.environ['LOCAL_RANK'] = str(local_rank)
os.environ['RANK'] = str(local_rank)
self.rank_set = True
While this runs on hardware with 4 gpus, performance is seriously degraded. I get 3-4 it/s on one GPU, I get .8 it/s on 4 GPUs. It isn't clear to me whether this is caused by a misconfiguration of mosaic streaming, or whether its to be expected from the ddp_notebook
strategy.
On 8 gpus, however, the call to instantiate the StreamingDataset
fails with this error:
FileExistsError: [Errno 17] File exists: '/000012_locals'
where the number preceding "locals" changes each run.
The stack trace is:
File "/root/.ipykernel/3774/command-3760228790545520-2502499197", line 23, in setup
self.train_dataset = StreamingDataset(local=f"{data_storage_location}/mds_{experiment_name}_train", shuffle=True, batch_size=self.train_batch_size)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/local_disk0/.ephemeral_nfs/envs/pythonEnv-636537cb-4a5f-463f-be27-bc3277f07b7a/lib/python3.11/site-packages/streaming/base/dataset.py", line 529, in __init__
self._shm_prefix_int, self._locals_shm = get_shm_prefix(streams_local, streams_remote,
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/local_disk0/.ephemeral_nfs/envs/pythonEnv-636537cb-4a5f-463f-be27-bc3277f07b7a/lib/python3.11/site-packages/streaming/base/shared/prefix.py", line 192, in get_shm_prefix
shm = SharedMemory(name, True, len(data))
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/local_disk0/.ephemeral_nfs/envs/pythonEnv-636537cb-4a5f-463f-be27-bc3277f07b7a/lib/python3.11/site-packages/streaming/base/shared/memory.py", line 41, in __init__
shm = BuiltinSharedMemory(name, create, size)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/usr/lib/python3.11/multiprocessing/shared_memory.py", line 104, in __init__
self._fd = _posixshmem.shm_open(
^^^^^^^^^^^^^^^^^^^^^
For those reasons, I'm leaving this open, and tagging @XiaohanZhangCMU one more time to see if he has any advice?
One amendment:
Adding
os.environ['MASTER_ADDR'] = '127.0.0.1'
os.environ['MASTER_PORT'] = str('12355')
to the callback enabled it to launch on 8 gpus, but performance fell to .26 it/s.
@elbamos sorry not many of us have hands-on experience with lightning, so not much insights can offer here. (do you consider switching to composer?)
Streaming uses SharedMemory and resource_tracker to orchestrate processes and manipulate shared arrays/scalars etc. I am not very sure whether "create a pytorch lightning DataModule that instantiates the StreamingDataset" would comply with that design, which may be the main source of performance degradation.
I am considering switching to composer; I'm not sure if I can run composer on multiple gpus from a notebook though?
Using the DataModule
means that the call to create the StreamingDataset()
happens multiple times, once in each spawned process.
Using the DataModule means that the call to create the StreamingDataset() happens multiple times, once in each spawned process.
That messes up with streaming's initialization.
If you are running a notebook, have you tried torchdistributor + lightning? E.g.,
def main_training_loop(log_path, num_gpus, num_nodes):
import torch
from lightning import Trainer
torch.set_float32_matmul_precision(precision="medium")
# device_stats = DeviceStatsMonitor()
trainer = Trainer(accelerator="gpu",
devices=num_gpus,
num_nodes=num_nodes,
strategy="ddp_notebook")
trainer.fit(model=..., datamodule=....)
from pyspark.ml.torch.distributor import TorchDistributor
NUM_PROCESSES = 2 # 2 gpus
output = TorchDistributor(num_processes=NUM_PROCESSES, local_mode=True, use_gpu=True)\
.run(main_training_loop)
Thank you for the torch distributor suggestion. That looks like a potentially promising approach. I was able to get it running with some work. But -
If I create the StreamingDataset
directly inside main_training_loop
, I get an NCCL error. If I use a DataModule
and create the StreamingDataset
from the setup
function, training does begin, but performance drops to 0.02 it/s (on 8 GPUs).
Hi, I use lightning with Mosaic Streaming. The trick is to launch your training script with torchrun
. Then everything more or less works.
@elbamos Can you try trochrun as what @jbohnslav suggested? Let us know if it works.
I've been trying that this morning, thank you to both of you.
Executing torchrun
directly in the Databricks notebook environment doesn't work, because it doesn't see manually installed Python packages. Calling the pyspark torch distributor with the path to a file instead of with a function, however, according to the documentation, calls torchrun
under the hood. I've been trying that. The code executes, but I'm still seeing the performance drop, to 0.02 it/s on 4 GPUs. (It does go up to .36 it/s if I set the number of workers to 0. It isn't clear to me from the streaming
documentation whether the number of workers should be 0 or the number of available cores / num_gpus, so I've tried it both ways. Interestingly, the validation speed is 11 it/s with one worker per core, and 0.04 it/s with the number of workers set to 0.)
@jbohnslav can you share any more details of your configuration? Are you building the StreamingDataset
inside a DataModule
? Using local_mode
? Have arguments to torchrun
? Using lightning CLI?
I think you're seeing two separate issues: if you can't get streaming dataset to work at all with pytorch lightning, then torchrun
is our solution. If you're having throughput issues, configuring the Streaming Dataset for optimal performance is a pretty complex endeavor with lots of things to try.
Executing torchrun directly in the Databricks notebook environment doesn't work, because it doesn't see manually installed Python packages.
I can't help with a databricks notebook environment. If you can't call torchrun at a command line, you can just import it like so: from torch.distributed.run import main as torchrun
.
Are you building the StreamingDataset inside a DataModule? Using local_mode? Have arguments to torchrun? Using lightning CLI?
We are building the dataset in a DataModule
. I'm not sure what local_mode
is. We have arguments to torchrun depending on the number of GPUs, nodes, etc. We're using the c10d
backend. We're launching from our own python script, not torchrun
at the command line or the lightning CLI.
Also getting something similar
[rank1]: Traceback (most recent call last):
[rank1]: File "/home/august/cfdx/ai/dna_fm/lightning_gym.py", line 59, in <module>
[rank1]: main()
[rank1]: File "/home/august/cfdx/ai/dna_fm/lightning_gym.py", line 55, in main
[rank1]: run_experiment(config)
[rank1]: File "/home/august/cfdx/ai/dna_fm/lightning_gym.py", line 24, in run_experiment
[rank1]: trainer.fit(
[rank1]: File "/opt/conda/lib/python3.10/site-packages/lightning/pytorch/trainer/trainer.py", line 538, in fit
[rank1]: call._call_and_handle_interrupt(
[rank1]: File "/opt/conda/lib/python3.10/site-packages/lightning/pytorch/trainer/call.py", line 46, in _call_and_handle_interrupt
[rank1]: return trainer.strategy.launcher.launch(trainer_fn, *args, trainer=trainer, **kwargs)
[rank1]: File "/opt/conda/lib/python3.10/site-packages/lightning/pytorch/strategies/launchers/subprocess_script.py", line 105, in launch
[rank1]: return function(*args, **kwargs)
[rank1]: File "/opt/conda/lib/python3.10/site-packages/lightning/pytorch/trainer/trainer.py", line 574, in _fit_impl
[rank1]: self._run(model, ckpt_path=ckpt_path)
[rank1]: File "/opt/conda/lib/python3.10/site-packages/lightning/pytorch/trainer/trainer.py", line 943, in _run
[rank1]: call._call_setup_hook(self) # allow user to set up LightningModule in accelerator environment
[rank1]: File "/opt/conda/lib/python3.10/site-packages/lightning/pytorch/trainer/call.py", line 102, in _call_setup_hook
[rank1]: _call_lightning_datamodule_hook(trainer, "setup", stage=fn)
[rank1]: File "/opt/conda/lib/python3.10/site-packages/lightning/pytorch/trainer/call.py", line 189, in _call_lightning_datamodule_hook
[rank1]: return fn(*args, **kwargs)
[rank1]: File "/home/august/cfdx/ai/dna_fm/datasources/lightning_data_module.py", line 71, in setup
[rank1]: datasets = DatasetFactory.create_dataset(self.data_args, self.model_args, self.tokenizer)
[rank1]: File "/home/august/cfdx/ai/dna_fm/datasources/dataset_factory.py", line 289, in create_dataset
[rank1]: return get_mosaic_dataset(
[rank1]: File "/home/august/cfdx/ai/dna_fm/datasources/dataset_factory.py", line 201, in get_mosaic_dataset
[rank1]: result[set_name] = MosaicDatasetWithProcessing(
[rank1]: File "/home/august/cfdx/ai/dna_fm/datasources/dataset_factory.py", line 122, in __init__
[rank1]: super().__init__(
[rank1]: File "/opt/conda/lib/python3.10/site-packages/streaming/base/dataset.py", line 529, in __init__
[rank1]: self._shm_prefix_int, self._locals_shm = get_shm_prefix(streams_local, streams_remote,
[rank1]: File "/opt/conda/lib/python3.10/site-packages/streaming/base/shared/prefix.py", line 192, in get_shm_prefix
[rank1]: shm = SharedMemory(name, True, len(data))
[rank1]: File "/opt/conda/lib/python3.10/site-packages/streaming/base/shared/memory.py", line 41, in __init__
[rank1]: shm = BuiltinSharedMemory(name, create, size)
[rank1]: File "/opt/conda/lib/python3.10/multiprocessing/shared_memory.py", line 104, in __init__
[rank1]: self._fd = _posixshmem.shm_open(
[rank1]: FileExistsError: [Errno 17] File exists: '/000006_locals'
/opt/conda/lib/python3.10/multiprocessing/resource_tracker.py:224: UserWarning: resource_tracker: There appear to be 10 leaked shared_memory objects to clean up at shutdown
warnings.warn('resource_tracker: There appear to be %d '
/opt/conda/lib/python3.10/multiprocessing/resource_tracker.py:237: UserWarning: resource_tracker: '/000003_shard_states': [Errno 2] No such file or directory: '/000003_shard_states'
warnings.warn('resource_tracker: %r: %s' % (name, e))
/opt/conda/lib/python3.10/multiprocessing/resource_tracker.py:237: UserWarning: resource_tracker: '/000006_shard_access_times': [Errno 2] No such file or directory: '/000006_shard_access_times'
warnings.warn('resource_tracker: %r: %s' % (name, e))
/opt/conda/lib/python3.10/multiprocessing/resource_tracker.py:237: UserWarning: resource_tracker: '/000003_cache_usage': [Errno 2] No such file or directory: '/000003_cache_usage'
warnings.warn('resource_tracker: %r: %s' % (name, e))
/opt/conda/lib/python3.10/multiprocessing/resource_tracker.py:237: UserWarning: resource_tracker: '/000003_next_epoch': [Errno 2] No such file or directory: '/000003_next_epoch'
warnings.warn('resource_tracker: %r: %s' % (name, e))
/opt/conda/lib/python3.10/multiprocessing/resource_tracker.py:237: UserWarning: resource_tracker: '/000003_barrier': [Errno 2] No such file or directory: '/000003_barrier'
warnings.warn('resource_tracker: %r: %s' % (name, e))
/opt/conda/lib/python3.10/multiprocessing/resource_tracker.py:237: UserWarning: resource_tracker: '/000006_next_epoch': [Errno 2] No such file or directory: '/000006_next_epoch'
warnings.warn('resource_tracker: %r: %s' % (name, e))
/opt/conda/lib/python3.10/multiprocessing/resource_tracker.py:237: UserWarning: resource_tracker: '/000006_cache_usage': [Errno 2] No such file or directory: '/000006_cache_usage'
warnings.warn('resource_tracker: %r: %s' % (name, e))
/opt/conda/lib/python3.10/multiprocessing/resource_tracker.py:237: UserWarning: resource_tracker: '/000006_barrier': [Errno 2] No such file or directory: '/000006_barrier'
warnings.warn('resource_tracker: %r: %s' % (name, e))
/opt/conda/lib/python3.10/multiprocessing/resource_tracker.py:237: UserWarning: resource_tracker: '/000006_shard_states': [Errno 2] No such file or directory: '/000006_shard_states'
warnings.warn('resource_tracker: %r: %s' % (name, e))
/opt/conda/lib/python3.10/multiprocessing/resource_tracker.py:237: UserWarning: resource_tracker: '/000003_shard_access_times': [Errno 2] No such file or directory: '/000003_shard_access_times'
warnings.warn('resource_tracker: %r: %s' % (name, e))
[rank: 1] Child process with PID 11951 terminated with code 1. Forcefully terminating all other processes to avoid zombies 🧟
/opt/conda/lib/python3.10/multiprocessing/resource_tracker.py:224: UserWarning: resource_tracker: There appear to be 13 leaked shared_memory objects to clean up at shutdown
warnings.warn('resource_tracker: There appear to be %d '
/opt/conda/lib/python3.10/multiprocessing/resource_tracker.py:237: UserWarning: resource_tracker: '/000001_shard_states': [Errno 2] No such file or directory: '/000001_shard_states'
warnings.warn('resource_tracker: %r: %s' % (name, e))
/opt/conda/lib/python3.10/multiprocessing/resource_tracker.py:237: UserWarning: resource_tracker: '/000005_shard_states': [Errno 2] No such file or directory: '/000005_shard_states'
warnings.warn('resource_tracker: %r: %s' % (name, e))
/opt/conda/lib/python3.10/multiprocessing/resource_tracker.py:237: UserWarning: resource_tracker: '/000005_barrier': [Errno 2] No such file or directory: '/000005_barrier'
warnings.warn('resource_tracker: %r: %s' % (name, e))
/opt/conda/lib/python3.10/multiprocessing/resource_tracker.py:237: UserWarning: resource_tracker: '/000006_locals': [Errno 2] No such file or directory: '/000006_locals'
warnings.warn('resource_tracker: %r: %s' % (name, e))
/opt/conda/lib/python3.10/multiprocessing/resource_tracker.py:237: UserWarning: resource_tracker: '/000001_barrier': [Errno 2] No such file or directory: '/000001_barrier'
warnings.warn('resource_tracker: %r: %s' % (name, e))
/opt/conda/lib/python3.10/multiprocessing/resource_tracker.py:237: UserWarning: resource_tracker: '/000001_next_epoch': [Errno 2] No such file or directory: '/000001_next_epoch'
warnings.warn('resource_tracker: %r: %s' % (name, e))
/opt/conda/lib/python3.10/multiprocessing/resource_tracker.py:237: UserWarning: resource_tracker: '/000005_cache_usage': [Errno 2] No such file or directory: '/000005_cache_usage'
warnings.warn('resource_tracker: %r: %s' % (name, e))
/opt/conda/lib/python3.10/multiprocessing/resource_tracker.py:237: UserWarning: resource_tracker: '/000001_shard_access_times': [Errno 2] No such file or directory: '/000001_shard_access_times'
warnings.warn('resource_tracker: %r: %s' % (name, e))
/opt/conda/lib/python3.10/multiprocessing/resource_tracker.py:237: UserWarning: resource_tracker: '/000005_next_epoch': [Errno 2] No such file or directory: '/000005_next_epoch'
warnings.warn('resource_tracker: %r: %s' % (name, e))
/opt/conda/lib/python3.10/multiprocessing/resource_tracker.py:237: UserWarning: resource_tracker: '/000005_shard_access_times': [Errno 2] No such file or directory: '/000005_shard_access_times'
warnings.warn('resource_tracker: %r: %s' % (name, e))
/opt/conda/lib/python3.10/multiprocessing/resource_tracker.py:237: UserWarning: resource_tracker: '/000001_cache_usage': [Errno 2] No such file or directory: '/000001_cache_usage'
warnings.warn('resource_tracker: %r: %s' % (name, e))
/opt/conda/lib/python3.10/multiprocessing/resource_tracker.py:224: UserWarning: resource_tracker: There appear to be 13 leaked shared_memory objects to clean up at shutdown
warnings.warn('resource_tracker: There appear to be %d '
/opt/conda/lib/python3.10/multiprocessing/resource_tracker.py:237: UserWarning: resource_tracker: '/000004_barrier': [Errno 2] No such file or directory: '/000004_barrier'
warnings.warn('resource_tracker: %r: %s' % (name, e))
/opt/conda/lib/python3.10/multiprocessing/resource_tracker.py:237: UserWarning: resource_tracker: '/000004_shard_access_times': [Errno 2] No such file or directory: '/000004_shard_access_times'
warnings.warn('resource_tracker: %r: %s' % (name, e))
/opt/conda/lib/python3.10/multiprocessing/resource_tracker.py:237: UserWarning: resource_tracker: '/000005_locals': [Errno 2] No such file or directory: '/000005_locals'
warnings.warn('resource_tracker: %r: %s' % (name, e))
/opt/conda/lib/python3.10/multiprocessing/resource_tracker.py:237: UserWarning: resource_tracker: '/000004_next_epoch': [Errno 2] No such file or directory: '/000004_next_epoch'
warnings.warn('resource_tracker: %r: %s' % (name, e))
/opt/conda/lib/python3.10/multiprocessing/resource_tracker.py:237: UserWarning: resource_tracker: '/000007_locals': [Errno 2] No such file or directory: '/000007_locals'
warnings.warn('resource_tracker: %r: %s' % (name, e))
/opt/conda/lib/python3.10/multiprocessing/resource_tracker.py:237: UserWarning: resource_tracker: '/000007_shard_states': [Errno 2] No such file or directory: '/000007_shard_states'
warnings.warn('resource_tracker: %r: %s' % (name, e))
/opt/conda/lib/python3.10/multiprocessing/resource_tracker.py:237: UserWarning: resource_tracker: '/000007_next_epoch': [Errno 2] No such file or directory: '/000007_next_epoch'
warnings.warn('resource_tracker: %r: %s' % (name, e))
/opt/conda/lib/python3.10/multiprocessing/resource_tracker.py:237: UserWarning: resource_tracker: '/000007_shard_access_times': [Errno 2] No such file or directory: '/000007_shard_access_times'
warnings.warn('resource_tracker: %r: %s' % (name, e))
/opt/conda/lib/python3.10/multiprocessing/resource_tracker.py:237: UserWarning: resource_tracker: '/000007_cache_usage': [Errno 2] No such file or directory: '/000007_cache_usage'
warnings.warn('resource_tracker: %r: %s' % (name, e))
/opt/conda/lib/python3.10/multiprocessing/resource_tracker.py:237: UserWarning: resource_tracker: '/000004_cache_usage': [Errno 2] No such file or directory: '/000004_cache_usage'
warnings.warn('resource_tracker: %r: %s' % (name, e))
/opt/conda/lib/python3.10/multiprocessing/resource_tracker.py:237: UserWarning: resource_tracker: '/000004_shard_states': [Errno 2] No such file or directory: '/000004_shard_states'
warnings.warn('resource_tracker: %r: %s' % (name, e))
/opt/conda/lib/python3.10/multiprocessing/resource_tracker.py:237: UserWarning: resource_tracker: '/000007_barrier': [Errno 2] No such file or directory: '/000007_barrier'
warnings.warn('resource_tracker: %r: %s' % (name, e))
/opt/conda/lib/python3.10/multiprocessing/resource_tracker.py:224: UserWarning: resource_tracker: There appear to be 12 leaked shared_memory objects to clean up at shutdown
warnings.warn('resource_tracker: There appear to be %d '
/opt/conda/lib/python3.10/multiprocessing/resource_tracker.py:237: UserWarning: resource_tracker: '/000004_locals': [Errno 2] No such file or directory: '/000004_locals'
warnings.warn('resource_tracker: %r: %s' % (name, e))
/opt/conda/lib/python3.10/multiprocessing/resource_tracker.py:237: UserWarning: resource_tracker: '/000000_cache_usage': [Errno 2] No such file or directory: '/000000_cache_usage'
warnings.warn('resource_tracker: %r: %s' % (name, e))
/opt/conda/lib/python3.10/multiprocessing/resource_tracker.py:237: UserWarning: resource_tracker: '/000004_barrier': [Errno 2] No such file or directory: '/000004_barrier'
warnings.warn('resource_tracker: %r: %s' % (name, e))
/opt/conda/lib/python3.10/multiprocessing/resource_tracker.py:237: UserWarning: resource_tracker: '/000004_cache_usage': [Errno 2] No such file or directory: '/000004_cache_usage'
warnings.warn('resource_tracker: %r: %s' % (name, e))
/opt/conda/lib/python3.10/multiprocessing/resource_tracker.py:237: UserWarning: resource_tracker: '/000004_shard_access_times': [Errno 2] No such file or directory: '/000004_shard_access_times'
warnings.warn('resource_tracker: %r: %s' % (name, e))
/opt/conda/lib/python3.10/multiprocessing/resource_tracker.py:237: UserWarning: resource_tracker: '/000000_shard_states': [Errno 2] No such file or directory: '/000000_shard_states'
warnings.warn('resource_tracker: %r: %s' % (name, e))
/opt/conda/lib/python3.10/multiprocessing/resource_tracker.py:237: UserWarning: resource_tracker: '/000000_shard_access_times': [Errno 2] No such file or directory: '/000000_shard_access_times'
warnings.warn('resource_tracker: %r: %s' % (name, e))
/opt/conda/lib/python3.10/multiprocessing/resource_tracker.py:237: UserWarning: resource_tracker: '/000004_next_epoch': [Errno 2] No such file or directory: '/000004_next_epoch'
warnings.warn('resource_tracker: %r: %s' % (name, e))
/opt/conda/lib/python3.10/multiprocessing/resource_tracker.py:237: UserWarning: resource_tracker: '/000000_barrier': [Errno 2] No such file or directory: '/000000_barrier'
warnings.warn('resource_tracker: %r: %s' % (name, e))
/opt/conda/lib/python3.10/multiprocessing/resource_tracker.py:237: UserWarning: resource_tracker: '/000000_next_epoch': [Errno 2] No such file or directory: '/000000_next_epoch'
warnings.warn('resource_tracker: %r: %s' % (name, e))
/opt/conda/lib/python3.10/multiprocessing/resource_tracker.py:237: UserWarning: resource_tracker: '/000004_shard_states': [Errno 2] No such file or directory: '/000004_shard_states'
warnings.warn('resource_tracker: %r: %s' % (name, e))
/opt/conda/lib/python3.10/multiprocessing/resource_tracker.py:224: UserWarning: resource_tracker: There appear to be 12 leaked shared_memory objects to clean up at shutdown
warnings.warn('resource_tracker: There appear to be %d '
/opt/conda/lib/python3.10/multiprocessing/resource_tracker.py:237: UserWarning: resource_tracker: '/000006_next_epoch': [Errno 2] No such file or directory: '/000006_next_epoch'
warnings.warn('resource_tracker: %r: %s' % (name, e))
/opt/conda/lib/python3.10/multiprocessing/resource_tracker.py:237: UserWarning: resource_tracker: '/000003_next_epoch': [Errno 2] No such file or directory: '/000003_next_epoch'
warnings.warn('resource_tracker: %r: %s' % (name, e))
/opt/conda/lib/python3.10/multiprocessing/resource_tracker.py:237: UserWarning: resource_tracker: '/000003_shard_states': [Errno 2] No such file or directory: '/000003_shard_states'
warnings.warn('resource_tracker: %r: %s' % (name, e))
/opt/conda/lib/python3.10/multiprocessing/resource_tracker.py:237: UserWarning: resource_tracker: '/000003_shard_access_times': [Errno 2] No such file or directory: '/000003_shard_access_times'
warnings.warn('resource_tracker: %r: %s' % (name, e))
/opt/conda/lib/python3.10/multiprocessing/resource_tracker.py:224: UserWarning: resource_tracker: There appear to be 13 leaked shared_memory objects to clean up at shutdown
warnings.warn('resource_tracker: There appear to be %d '
/opt/conda/lib/python3.10/multiprocessing/resource_tracker.py:237: UserWarning: resource_tracker: '/000006_shard_states': [Errno 2] No such file or directory: '/000006_shard_states'
warnings.warn('resource_tracker: %r: %s' % (name, e))
/opt/conda/lib/python3.10/multiprocessing/resource_tracker.py:237: UserWarning: resource_tracker: '/000003_cache_usage': [Errno 2] No such file or directory: '/000003_cache_usage'
warnings.warn('resource_tracker: %r: %s' % (name, e))
/opt/conda/lib/python3.10/multiprocessing/resource_tracker.py:237: UserWarning: resource_tracker: '/000003_barrier': [Errno 2] No such file or directory: '/000003_barrier'
warnings.warn('resource_tracker: %r: %s' % (name, e))
/opt/conda/lib/python3.10/multiprocessing/resource_tracker.py:237: UserWarning: resource_tracker: '/000006_shard_access_times': [Errno 2] No such file or directory: '/000006_shard_access_times'
warnings.warn('resource_tracker: %r: %s' % (name, e))
/opt/conda/lib/python3.10/multiprocessing/resource_tracker.py:237: UserWarning: resource_tracker: '/000005_cache_usage': [Errno 2] No such file or directory: '/000005_cache_usage'
warnings.warn('resource_tracker: %r: %s' % (name, e))
/opt/conda/lib/python3.10/multiprocessing/resource_tracker.py:237: UserWarning: resource_tracker: '/000003_locals': [Errno 2] No such file or directory: '/000003_locals'
warnings.warn('resource_tracker: %r: %s' % (name, e))
/opt/conda/lib/python3.10/multiprocessing/resource_tracker.py:237: UserWarning: resource_tracker: '/000006_locals': [Errno 2] No such file or directory: '/000006_locals'
warnings.warn('resource_tracker: %r: %s' % (name, e))
/opt/conda/lib/python3.10/multiprocessing/resource_tracker.py:237: UserWarning: resource_tracker: '/000001_cache_usage': [Errno 2] No such file or directory: '/000001_cache_usage'
warnings.warn('resource_tracker: %r: %s' % (name, e))
/opt/conda/lib/python3.10/multiprocessing/resource_tracker.py:237: UserWarning: resource_tracker: '/000006_cache_usage': [Errno 2] No such file or directory: '/000006_cache_usage'
warnings.warn('resource_tracker: %r: %s' % (name, e))
/opt/conda/lib/python3.10/multiprocessing/resource_tracker.py:237: UserWarning: resource_tracker: '/000006_barrier': [Errno 2] No such file or directory: '/000006_barrier'
warnings.warn('resource_tracker: %r: %s' % (name, e))
/opt/conda/lib/python3.10/multiprocessing/resource_tracker.py:237: UserWarning: resource_tracker: '/000001_shard_states': [Errno 2] No such file or directory: '/000001_shard_states'
warnings.warn('resource_tracker: %r: %s' % (name, e))
/opt/conda/lib/python3.10/multiprocessing/resource_tracker.py:237: UserWarning: resource_tracker: '/000005_locals': [Errno 2] No such file or directory: '/000005_locals'
warnings.warn('resource_tracker: %r: %s' % (name, e))
/opt/conda/lib/python3.10/multiprocessing/resource_tracker.py:237: UserWarning: resource_tracker: '/000001_next_epoch': [Errno 2] No such file or directory: '/000001_next_epoch'
warnings.warn('resource_tracker: %r: %s' % (name, e))
/opt/conda/lib/python3.10/multiprocessing/resource_tracker.py:237: UserWarning: resource_tracker: '/000005_shard_states': [Errno 2] No such file or directory: '/000005_shard_states'
warnings.warn('resource_tracker: %r: %s' % (name, e))
/opt/conda/lib/python3.10/multiprocessing/resource_tracker.py:237: UserWarning: resource_tracker: '/000005_barrier': [Errno 2] No such file or directory: '/000005_barrier'
warnings.warn('resource_tracker: %r: %s' % (name, e))
/opt/conda/lib/python3.10/multiprocessing/resource_tracker.py:237: UserWarning: resource_tracker: '/000005_shard_access_times': [Errno 2] No such file or directory: '/000005_shard_access_times'
warnings.warn('resource_tracker: %r: %s' % (name, e))
/opt/conda/lib/python3.10/multiprocessing/resource_tracker.py:237: UserWarning: resource_tracker: '/000005_next_epoch': [Errno 2] No such file or directory: '/000005_next_epoch'
warnings.warn('resource_tracker: %r: %s' % (name, e))
/opt/conda/lib/python3.10/multiprocessing/resource_tracker.py:237: UserWarning: resource_tracker: '/000001_locals': [Errno 2] No such file or directory: '/000001_locals'
warnings.warn('resource_tracker: %r: %s' % (name, e))
/opt/conda/lib/python3.10/multiprocessing/resource_tracker.py:237: UserWarning: resource_tracker: '/000001_barrier': [Errno 2] No such file or directory: '/000001_barrier'
warnings.warn('resource_tracker: %r: %s' % (name, e))
/opt/conda/lib/python3.10/multiprocessing/resource_tracker.py:237: UserWarning: resource_tracker: '/000001_shard_access_times': [Errno 2] No such file or directory: '/000001_shard_access_times'
warnings.warn('resource_tracker: %r: %s' % (name, e))
/opt/conda/lib/python3.10/multiprocessing/resource_tracker.py:224: UserWarning: resource_tracker: There appear to be 7 leaked shared_memory objects to clean up at shutdown
warnings.warn('resource_tracker: There appear to be %d '
/opt/conda/lib/python3.10/multiprocessing/resource_tracker.py:237: UserWarning: resource_tracker: '/000002_next_epoch': [Errno 2] No such file or directory: '/000002_next_epoch'
warnings.warn('resource_tracker: %r: %s' % (name, e))
/opt/conda/lib/python3.10/multiprocessing/resource_tracker.py:237: UserWarning: resource_tracker: '/000002_barrier': [Errno 2] No such file or directory: '/000002_barrier'
warnings.warn('resource_tracker: %r: %s' % (name, e))
/opt/conda/lib/python3.10/multiprocessing/resource_tracker.py:237: UserWarning: resource_tracker: '/000002_shard_states': [Errno 2] No such file or directory: '/000002_shard_states'
warnings.warn('resource_tracker: %r: %s' % (name, e))
/opt/conda/lib/python3.10/multiprocessing/resource_tracker.py:237: UserWarning: resource_tracker: '/000002_cache_usage': [Errno 2] No such file or directory: '/000002_cache_usage'
warnings.warn('resource_tracker: %r: %s' % (name, e))
/opt/conda/lib/python3.10/multiprocessing/resource_tracker.py:237: UserWarning: resource_tracker: '/000002_shard_access_times': [Errno 2] No such file or directory: '/000002_shard_access_times'
warnings.warn('resource_tracker: %r: %s' % (name, e))
/opt/conda/lib/python3.10/multiprocessing/resource_tracker.py:237: UserWarning: resource_tracker: '/000001_locals': [Errno 2] No such file or directory: '/000001_locals'
warnings.warn('resource_tracker: %r: %s' % (name, e))
/opt/conda/lib/python3.10/multiprocessing/resource_tracker.py:237: UserWarning: resource_tracker: '/000002_locals': [Errno 2] No such file or directory: '/000002_locals'
warnings.warn('resource_tracker: %r: %s' % (name, e))
/opt/conda/lib/python3.10/multiprocessing/resource_tracker.py:224: UserWarning: resource_tracker: There appear to be 1 leaked shared_memory objects to clean up at shutdown
warnings.warn('resource_tracker: There appear to be %d '
/opt/conda/lib/python3.10/multiprocessing/resource_tracker.py:237: UserWarning: resource_tracker: '/000004_locals': [Errno 2] No such file or directory: '/000004_locals'
warnings.warn('resource_tracker: %r: %s' % (name, e))
@elbamos As mentioned, torchrun or torch distributor work with StreamingDataset, in addition to Composer. From a Databricks notebook, torch distributor should make launching your job easy.
@jbohnslav Regarding:
If you're having throughput issues, configuring the Streaming Dataset for optimal performance is a pretty complex endeavor with lots of things to try. We've built the Streaming simulator for exactly this issue -- if you're seeing dataloader bottlenecks or want to optimize dataloading performance, we highly recommend using it.
@AugustDev You filed #781, correct? @XiaohanZhangCMU's recommendations there make sense to me -- you can see the currently running processes with top
and kill them. Then clear your stale shared memory and rerun training.
Environment
To reproduce
Steps to reproduce the behavior:
Expected behavior
I'd expect training to begin.
Additional context