'File exists: "/00000_locals"' when integrated with deepspeed training scripts

Clement25 commented 3 weeks ago

Environment

OS: [Ubuntu 22.04.2 LTS]
Hardware (GPU, or instance type): [A800]

To reproduce

Steps to reproduce the behavior:

pip install deepspeed
deepspeed train.py ... (training arguments are omitted)

Expected behavior

[2024-07-08 15:29:47]   File "/mnt/data/weihan/projects/cepe/data.py", line 226, in load_streams
[2024-07-08 15:29:47]     self.encoder_decoder_dataset = StreamingDataset(streams=streams, epoch_size=self.epoch_size, allow_unsafe_types=True)
[2024-07-08 15:29:47]   File "/opt/conda/envs/cepe/lib/python3.10/site-packages/streaming/base/dataset.py", line 513, in __init__
[2024-07-08 15:29:47]     self._shm_prefix_int, self._locals_shm = get_shm_prefix(streams_local, streams_remote,
[2024-07-08 15:29:47]   File "/opt/conda/envs/cepe/lib/python3.10/site-packages/streaming/base/shared/prefix.py", line 192, in get_shm_prefix
[2024-07-08 15:29:47]     shm = SharedMemory(name, True, len(data))
[2024-07-08 15:29:47]   File "/opt/conda/envs/cepe/lib/python3.10/site-packages/streaming/base/shared/memory.py", line 41, in __init__
[2024-07-08 15:29:47]     shm = BuiltinSharedMemory(name, create, size)
[2024-07-08 15:29:47]   File "/opt/conda/envs/cepe/lib/python3.10/multiprocessing/shared_memory.py", line 104, in __init__
[2024-07-08 15:29:47]     self._fd = _posixshmem.shm_open(
[2024-07-08 15:29:47] FileExistsError: [Errno 17] File exists: '/000000_locals'

Additional context

snarayan21 commented 2 weeks ago

Hey, this seems like there's some stale shared memory. Just once, at the start of your training job, can you add a call to streaming.base.util.clean_stale_shared_memory() and see if that addresses the issue?

sukritipaul5 commented 2 weeks ago

Hey @snarayan21 ! :) I've tried this to no avail. I also downgraded mosaicml and deepspeed versions. Let me know if you have any other suggestion(s). I'm using A100s.

Clement25 commented 2 weeks ago

Hey, this seems like there's some stale shared memory. Just once, at the start of your training job, can you add a call to streaming.base.util.clean_stale_shared_memory() and see if that addresses the issue?

I tried but it didn't work.

Clement25 commented 1 week ago

Hey, this seems like there's some stale shared memory. Just once, at the start of your training job, can you add a call to streaming.base.util.clean_stale_shared_memory() and see if that addresses the issue?

I solved by setting env variable "LOCAL_WORLD_SIZE=$NUM_GPU"

mosaicml / streaming