Open HIT-cwh opened 9 months ago
+1 with this issue
+1
@HIT-cwh @SolenoidWGT
A fix would be making a customized CacheManager
by setting os.environ["TRITON_CACHE_MANAGER"] = '...'
. Reference: https://github.com/openai/triton/blob/main/python/triton/runtime/cache.py.
In this manager, we only put the files from rank 0 and make a barrier for all ranks, e.g.:
class ModifiedCacheManager(FileCacheManager):
def put(self, data, filename, binary=True) -> str:
if not self.cache_dir:
raise RuntimeError("Could not create or locate cache dir")
binary = isinstance(data, bytes)
if not binary:
data = str(data)
assert self.lock_path is not None
filepath = self._make_path(filename)
# Random ID to avoid any collisions
rnd_id = random.randint(0, 1000000)
# we use the PID incase a bunch of these around so we can see what PID made it
pid = os.getpid()
# use tempfile to be robust against program interruptions
# *** Rank 0 only ***
if get_rank() == 0:
temp_path = f"{filepath}.tmp.pid_{pid}_{rnd_id}"
mode = "wb" if binary else "w"
with open(temp_path, mode) as f:
f.write(data)
# Replace is guaranteed to be atomic on POSIX systems if it succeeds
# so filepath cannot see a partial write
os.replace(temp_path, filepath)
# *** Add a distributed barrier ***
barrier()
return filepath
In my case, it works fine (you must ensure the code path is the same on all ranks).
UPD (2024.04.07): I think this problem is fixed after https://github.com/openai/triton/pull/3544, perhaps you don't need a customized manager anymore.
@HIT-cwh Hi, I met the same issue and resolved it by setting TRITON_CACHE_DIR
to a local storage instead of a shared storage.
The root cause in my case is my multi-nodes cluster with a shared storage. There are coincidentally processes with the same pid
, since they are on different machines. And I set the same random seed, so they are sharing the same rng_id
.
As a result, 2 processes tries to os.replace
the same temp_path
simultaneously, and the latter one crashes because temp_path
doesn't exist.
@LyricZhao Is there any plan to support multi-node cluster with shared storage, without a handcraft CacheManager
plugin? I suspect reverting the following PR https://github.com/openai/triton/pull/1569 can make it work.
@HIT-cwh Hi, I met the same issue and resolved it by setting
TRITON_CACHE_DIR
to a local storage instead of a shared storage.The root cause in my case is my multi-nodes cluster with a shared storage. There are coincidentally processes with the same
pid
, since they are on different machines. And I set the same random seed, so they are sharing the samerng_id
.As a result, 2 processes tries to
os.replace
the sametemp_path
simultaneously, and the latter one crashes becausetemp_path
doesn't exist.@LyricZhao Is there any plan to support multi-node cluster with shared storage, without a handcraft
CacheManager
plugin? I suspect reverting the following PR #1569 can make it work.
Hi, I guess this PR will help, https://github.com/openai/triton/pull/3544.
BTW, I think FileLock
isn't well supported on some NFS, while the atomic feature of os.replace
is guaranteed on almost all FS. So making pid
and rng_id
different is enough.
During the process of distributed training, I encountered the following problem when compiling Triton kernels:
The above error only occurs during distributed training (multi-process), and both '/mnt/petrelfs/caoweihan/.triton/cache/cff628804055ab05f902072733c9ab2d/_rms_norm_bwd_dx_fused.ttir.tmp.pid_15735_304289' and '/mnt/petrelfs/caoweihan/.triton/cache/cff628804055ab05f902072733c9ab2d/_rms_norm_bwd_dx_fused.ttir' files do exist.
Given that the intermediate results across different processes are identical, I attempted to replace:
with:
This tweak squashed the error, but it's not cool.
I would appreciate if anyone could explain why this issue arises. After all,
os.replace(temp_path, filepath)
should be playing nice as an atomic operation.Here is my system environment: