krystalan / ClidSum

EMNLP 2022: ClidSum: A Benchmark Dataset for Cross-Lingual Dialogue Summarization
https://arxiv.org/abs/2202.05599
33 stars 2 forks source link

OverflowError: int too big to convert #2

Closed 18445864529 closed 2 years ago

18445864529 commented 2 years ago

Hi. When I follow the readme and do the evaluation, I encountered the following error. Could you give me some help? Thank you in advance.

(zbw) pikaia34:/net/papilio/storage2/bowenz/clidsum $ sh eval.sh 
Some weights of the model checkpoint at /net/papilio/storage2/bowenz/tools/chinese-bert-wwm-ext were not used when initializing BertModel: ['cls.seq_relationship.weight', 'cls.predictions.bias', 'cls.predictions.transform.LayerNorm.bias', 'cls.predictions.transform.dense.weight', 'cls.predictions.transform.dense.bias', 'cls.predictions.transform.LayerNorm.weight', 'cls.seq_relationship.bias', 'cls.predictions.decoder.weight']
- This IS expected if you are initializing BertModel from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing BertModel from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
Traceback (most recent call last):
  File "/net/papilio/storage2/bowenz/anaconda3/envs/zbw/bin/bert-score", line 8, in <module>
    sys.exit(main())
  File "/net/papilio/storage2/bowenz/anaconda3/envs/zbw/lib/python3.9/site-packages/bert_score_cli/score.py", line 59, in main
    all_preds, hash_code = bert_score.score(
  File "/net/papilio/storage2/bowenz/anaconda3/envs/zbw/lib/python3.9/site-packages/bert_score/score.py", line 131, in score
    all_preds = bert_cos_score_idf(
  File "/net/papilio/storage2/bowenz/anaconda3/envs/zbw/lib/python3.9/site-packages/bert_score/utils.py", line 528, in bert_cos_score_idf
    embs, masks, padded_idf = get_bert_embedding(
  File "/net/papilio/storage2/bowenz/anaconda3/envs/zbw/lib/python3.9/site-packages/bert_score/utils.py", line 399, in get_bert_embedding
    padded_sens, padded_idf, lens, mask = collate_idf(all_sens, tokenizer, idf_dict, device=device)
  File "/net/papilio/storage2/bowenz/anaconda3/envs/zbw/lib/python3.9/site-packages/bert_score/utils.py", line 371, in collate_idf
    arr = [sent_encode(tokenizer, a) for a in arr]
  File "/net/papilio/storage2/bowenz/anaconda3/envs/zbw/lib/python3.9/site-packages/bert_score/utils.py", line 371, in <listcomp>
    arr = [sent_encode(tokenizer, a) for a in arr]
  File "/net/papilio/storage2/bowenz/anaconda3/envs/zbw/lib/python3.9/site-packages/bert_score/utils.py", line 213, in sent_encode
    return tokenizer.encode(
  File "/net/papilio/storage2/bowenz/anaconda3/envs/zbw/lib/python3.9/site-packages/transformers/tokenization_utils_base.py", line 2218, in encode
    encoded_inputs = self.encode_plus(
  File "/net/papilio/storage2/bowenz/anaconda3/envs/zbw/lib/python3.9/site-packages/transformers/tokenization_utils_base.py", line 2548, in encode_plus
    return self._encode_plus(
  File "/net/papilio/storage2/bowenz/anaconda3/envs/zbw/lib/python3.9/site-packages/transformers/tokenization_utils_fast.py", line 498, in _encode_plus
    batched_output = self._batch_encode_plus(
  File "/net/papilio/storage2/bowenz/anaconda3/envs/zbw/lib/python3.9/site-packages/transformers/tokenization_utils_fast.py", line 417, in _batch_encode_plus
    self.set_truncation_and_padding(
  File "/net/papilio/storage2/bowenz/anaconda3/envs/zbw/lib/python3.9/site-packages/transformers/tokenization_utils_fast.py", line 373, in set_truncation_and_padding
    self._tokenizer.enable_truncation(**target)
OverflowError: int too big to convert

Here is the shell script I used.

model_path=/net/papilio/storage2/bowenz/tools/chinese-bert-wwm-ext
gold_file_path=/net/papilio/storage2/bowenz/clidsum/model_output/mdialbart_zh
generate_file_path=/net/papilio/storage2/bowenz/clidsum/model_output/mdialbart_zh

for num in {0..20}; do
  bert-score -r $gold_file_path/gold_summary_$num -c $generate_file_path/generated_summary_$num --lang zh --model $model_path --num_layers 8
done
18445864529 commented 2 years ago

Made it work by editing the source code of bert-score as tokenizer.encode(max_length=min(512, tokenizer.model_max_length)) as it does not support length larger than 512 but tokenizer.model_max_length may somehow exceed it (didn't explore furthur but at least worked).