Open Clement25 opened 3 weeks ago
Hey, this seems like there's some stale shared memory. Just once, at the start of your training job, can you add a call to streaming.base.util.clean_stale_shared_memory()
and see if that addresses the issue?
Hey @snarayan21 ! :) I've tried this to no avail. I also downgraded mosaicml and deepspeed versions. Let me know if you have any other suggestion(s). I'm using A100s.
Hey, this seems like there's some stale shared memory. Just once, at the start of your training job, can you add a call to
streaming.base.util.clean_stale_shared_memory()
and see if that addresses the issue?
I tried but it didn't work.
Hey, this seems like there's some stale shared memory. Just once, at the start of your training job, can you add a call to
streaming.base.util.clean_stale_shared_memory()
and see if that addresses the issue?
I solved by setting env variable "LOCAL_WORLD_SIZE=$NUM_GPU"
Environment
To reproduce
Steps to reproduce the behavior:
Expected behavior
Additional context