Closed Longyichen closed 11 months ago
As the error message says, it may be caused by the lack of upadte.
[stderr]: ******************************
[metric][batch=0]: time/epoch: 0
[stderr]: /root/miniconda3/envs/shearing/lib/python3.10/site-packages/torchmetrics/utilities/prints.py:42: UserWarning: The ``compute`` method of metric LanguageCrossEntropy was called before the ``update`` method which may lead to errors, as metric states have not yet been updated.
[stderr]: warnings.warn(*args, **kwargs) # noqa: B028
[stderr]: /root/miniconda3/envs/shearing/lib/python3.10/site-packages/torchmetrics/utilities/prints.py:42: UserWarning: The ``compute`` method of metric LanguagePerplexity was called before the ``update`` method which may lead to errors, as metric states have not yet been updated.
[stderr]: warnings.warn(*args, **kwargs) # noqa: B028
[stderr]: /root/miniconda3/envs/shearing/lib/python3.10/site-packages/torchmetrics/utilities/prints.py:42: UserWarning: The ``compute`` method of metric DomainLanguageCrossEntropy was called before the ``update`` method which may lead to errors, as metric states have not yet been updated.
[stderr]: warnings.warn(*args, **kwargs) # noqa: B028
[stderr]: /root/miniconda3/envs/shearing/lib/python3.10/site-packages/torchmetrics/utilities/prints.py:42: UserWarning: The ``compute`` method of metric DomainCount was called before the ``update`` method which may lead to errors, as metric states have not yet been updated.
[stderr]: warnings.warn(*args, **kwargs) # noqa: B028
[metric][batch=0]: metrics/train/LanguageCrossEntropy: nan
[metric][batch=0]: metrics/train/Perplexity: nan
[metric][batch=0]: metrics/train/ArXiv_LanguageCrossEntropy: nan
[metric][batch=0]: metrics/train/ArXiv_count: 0
[metric][batch=0]: metrics/train/Books_LanguageCrossEntropy: nan
[metric][batch=0]: metrics/train/Books_count: 0
[metric][batch=0]: metrics/train/Wikipedia_LanguageCrossEntropy: nan
[metric][batch=0]: metrics/train/Wikipedia_count: 0
It could be that the dataset you've created may not include the set
entry. The count of each domain and the corresponding Cross-Entropy (CE) values are computed based on each block's data's set name. Could you check your mds files and see if it is the case?
It could be that the dataset you've created may not include the
set
entry. The count of each domain and the corresponding Cross-Entropy (CE) values are computed based on each block's data's set name. Could you check your mds files and see if it is the case?
hi i list the name in mds. Actually i change the set name to the file name in dataset
ls mds_redpajama/for_prune/
ArXiv/ Books/ eval_merge/ train_small/ Wikipedia/
Do you have an entry of set
for each data point in your mds files for each domain? And does it correspond to the set_names
you pass to the script?
This line here collects each data point's set
entry and use it as part of the input to the model.
Do you have an entry of
set
for each data point in your mds files for each domain? And does it correspond to theset_names
you pass to the script?This line here collects each data point's
set
entry and use it as part of the input to the model.
I looked at the index file of each domain, which includes set, such as:
I directly injected print into the line of code you provided, and the print result is as follows:
print('batch: ', batch)
print('examples: ', [example["set"] for example in examples] )
batch["set"] = torch.tensor(
[self.set_name_to_id[example["set"]] for example in examples])
print('batch["set"]',batch["set"])
---
batch: batch: batch: {'input_ids': tensor([[ 372, 29889, 1334, ..., 540, 1497, 29889],
[ 372, 29889, 1334, ..., 540, 1497, 29889],
[ 372, 29889, 1334, ..., 540, 1497, 29889],
...,
[ 3431, 29889, 29871, ..., 29953, 29900, 29892],
[ 3431, 29889, 29871, ..., 29953, 29900, 29892],
[ 3431, 29889, 29871, ..., 29953, 29900, 29892]]), 'labels': tensor([[ 372, 29889, 1334, ..., 540, 1497, 29889],
[ 372, 29889, 1334, ..., 540, 1497, 29889],
[ 372, 29889, 1334, ..., 540, 1497, 29889],
...,
[ 3431, 29889, 29871, ..., 29953, 29900, 29892],
[ 3431, 29889, 29871, ..., 29953, 29900, 29892],
[ 3431, 29889, 29871, ..., 29953, 29900, 29892]])}
examples: ['Books', 'Books', 'Books', 'Wikipedia', 'Wikipedia', 'Wikipedia', 'Wikipedia', 'Wikipedia']
batch["set"] tensor([1, 1, 1, 2, 2, 2, 2, 2])
This result seems that each of the examples corresponds to the domain to which each piece of data belongs. Is this result normal?
I found that aligning the names of all folders with the names of the author folders solved the problem. issue closed
我发现将所有文件夹的名称与作者文件夹的名称对齐可以解决问题。问题已结束
I have same issue, how fix it?
When I conducted the pruning experiment, I simply configured the data set and made no other changes. I found that it seems that the metric is not updated, and the log repeatedly prints loss as nan, as follows:
I set pdb breakpoints in metric's update function and composerllama's update_mteric, but these breakpoints were not executed. The input data seems to be intact. I tested trainloader and train.eval and everything is normal. However, this problem inevitably occurs in train.fit.
The Setting of pruning: