microsoft / Megatron-DeepSpeed

Ongoing research training transformer language models at scale, including: BERT & GPT-2
Other
1.9k stars 345 forks source link

FileNotFoundError: [Errno 2] No such file or directory: 'dataset/index-cache/xxx_doc_idx.npy' #356

Open GuanhuaWang opened 8 months ago

GuanhuaWang commented 8 months ago

When running distributed training using multiple nodes, it reports error at build_idx_mapping for doc/sample/shuffle.

I suspect it is a wait/sync error across nodes. We should add sync barrier after building file index complete.

Detailed error as below:

c000001: FileNotFoundError: [Errno 2] No such file or directory: '/work/guanhua/dataset/index-cache/43d5bc1477867ae12d66da383c2b664b_doc_idx.npy'
c000002:     train_ds, valid_ds, test_ds = build_train_valid_test_datasets(
c000002:   File "/work/guanhua/Megatron-DeepSpeed/megatron/data/gpt_dataset.py", line 35, in build_train_valid_test_datasets
c000002:     return _build_train_valid_test_datasets(data_prefix[0],
c000002:   File "/work/guanhua/Megatron-DeepSpeed/megatron/data/gpt_dataset.py", line 156, in _build_train_valid_test_datasets
c000002:     train_dataset = build_dataset(0, 'train')
c000002:   File "/work/guanhua/Megatron-DeepSpeed/megatron/data/gpt_dataset.py", line 148, in build_dataset
c000002:     dataset = GPTDataset(name, data_prefix, documents, indexed_dataset,
c000002:   File "/work/guanhua/Megatron-DeepSpeed/megatron/data/gpt_dataset.py", line 261, in __init__
c000001:     train_dataset = build_dataset(0, 'train')
c000001:   File "/work/guanhua/Megatron-DeepSpeed/megatron/data/gpt_dataset.py", line 148, in build_dataset
c000001:     doc_idx = np.load(idx_path['doc'], allow_pickle=True, mmap_mode='r')
c000001:   File "/usr/lib/python3/dist-packages/numpy/lib/npyio.py", line 417, in load
c000001:     fid = stack.enter_context(open(os_fspath(file), "rb"))
c000001: FileNotFoundError: [Errno 2] No such file or directory: '/work/guanhua/dataset/index-cache/43d5bc1477867ae12d66da383c2b664b_doc_idx.npy'
c000002:     _build_index_mappings(self.name, data_prefix,
c000002:   File "/work/guanhua/Megatron-DeepSpeed/megatron/data/gpt_dataset.py", line 495, in _build_index_mappings
c000001:     dataset = GPTDataset(name, data_prefix, documents, indexed_dataset,
c000001:   File "/work/guanhua/Megatron-DeepSpeed/megatron/data/gpt_dataset.py", line 261, in __init__
c000002:     doc_idx = np.load(idx_path['doc'], allow_pickle=True, mmap_mode='r')
c000002:   File "/usr/lib/python3/dist-packages/numpy/lib/npyio.py", line 417, in load
c000002:     fid = stack.enter_context(open(os_fspath(file), "rb"))
c000002: FileNotFoundError: [Errno 2] No such file or directory: '/work/guanhua/dataset/index-cache/43d5bc1477867ae12d66da383c2b664b_doc_idx.npy'
c000001:     _build_index_mappings(self.name, data_prefix,
c000001:   File "/work/guanhua/Megatron-DeepSpeed/megatron/data/gpt_dataset.py", line 495, in _build_index_mappings
c000001:     doc_idx = np.load(idx_path['doc'], allow_pickle=True, mmap_mode='r')
c000001:   File "/usr/lib/python3/dist-packages/numpy/lib/npyio.py", line 417, in load
c000001:     fid = stack.enter_context(open(os_fspath(file), "rb"))
c000001: FileNotFoundError: [Errno 2] No such file or directory: '/work/guanhua/dataset/index-cache/43d5bc1477867ae12d66da383c2b664b_doc_idx.npy'
carojr commented 8 months ago

I also encountered the same problem, has anyone found the reason or solution? Thanks

wuyingjun-lucky commented 8 months ago

I have met this problem too and It increase when we use container .

askiad commented 7 months ago

Any fix?

UniverseFly commented 5 months ago

For those of you who are still stuck at this issue, here is an easy fix.

The problem is that the index cache is only built for rank 0, so the rest of nodes cannot access the cache. An easy fix that works for me is to let every node build the same cache by simply changing the condition check in the following line to if build_indices and int(os.getenv("LOCAL_RANK", "0")) == 0:

https://github.com/microsoft/Megatron-DeepSpeed/blob/7eb36a11b3a9c48ed07b93692ccf22bfb5577f7e/megatron/data/gpt_dataset.py#L392

divisionblur commented 4 months ago

For those of you who are still stuck at this issue, here is an easy fix.对于那些仍然被困在这个问题上的人,这里有一个简单的解决方法。

The problem is that the index cache is only built for rank 0, so the rest of nodes cannot access the cache. An easy fix that works for me is to let every node build the same cache by simply changing the condition check in the following line to if build_indices and int(os.getenv("LOCAL_RANK", "0")) == 0:问题在于索引缓存仅针对排名 0 构建,因此其余节点无法访问缓存。对我有用的一个简单的解决方法是让每个节点构建相同的缓存,只需将以下行中的条件检查更改为 if build_indices and int(os.getenv("LOCAL_RANK", "0")) == 0:

https://github.com/microsoft/Megatron-DeepSpeed/blob/7eb36a11b3a9c48ed07b93692ccf22bfb5577f7e/megatron/data/gpt_dataset.py#L392

Is it okay to store the data processed by rank0 in shared storage?

i4never commented 4 months ago

if you have shared shared storage original code (only process on global rank0) is ok.