Open GuanhuaWang opened 8 months ago
I also encountered the same problem, has anyone found the reason or solution? Thanks
I have met this problem too and It increase when we use container .
Any fix?
For those of you who are still stuck at this issue, here is an easy fix.
The problem is that the index cache is only built for rank 0, so the rest of nodes cannot access the cache. An easy fix that works for me is to let every node build the same cache by simply changing the condition check in the following line to if build_indices and int(os.getenv("LOCAL_RANK", "0")) == 0:
For those of you who are still stuck at this issue, here is an easy fix.对于那些仍然被困在这个问题上的人,这里有一个简单的解决方法。
The problem is that the index cache is only built for rank 0, so the rest of nodes cannot access the cache. An easy fix that works for me is to let every node build the same cache by simply changing the condition check in the following line to
if build_indices and int(os.getenv("LOCAL_RANK", "0")) == 0:
问题在于索引缓存仅针对排名 0 构建,因此其余节点无法访问缓存。对我有用的一个简单的解决方法是让每个节点构建相同的缓存,只需将以下行中的条件检查更改为if build_indices and int(os.getenv("LOCAL_RANK", "0")) == 0:
Is it okay to store the data processed by rank0 in shared storage?
if you have shared shared storage original code (only process on global rank0) is ok.
When running distributed training using multiple nodes, it reports error at build_idx_mapping for doc/sample/shuffle.
I suspect it is a wait/sync error across nodes. We should add sync barrier after building file index complete.
Detailed error as below: