OverflowError: int too big to convert

lancopku / label-words-are-anchors

Repository for Label Words are Anchors: An Information Flow Perspective for Understanding In-Context Learning

MIT License

144 stars 12 forks source link

OverflowError: int too big to convert #5

Closed UGUESS-lzx closed 9 months ago

UGUESS-lzx commented 9 months ago

不好意思打扰了，请问运行attention_attr.py的时候发生OverflowError: int too big to convert报错是出现了什么问题呢？具体的位置显示是在 Traceback (most recent call last): File "/home/abc/icl/label-words-are-anchors/attention_attr.py", line 91, in demonstrations_contexted = prepare_analysis_dataset(args.seeds[0]) File "/home/abc/icl/label-words-are-anchors/attention_attr.py", line 87, in prepare_analysis_dataset demonstrations_contexted = tokenize_dataset(demonstrations_contexted, tokenizer=tokenizer) File "/home/abc/icl/label-words-are-anchors/icl/utils/data_wrapper.py", line 105, in tokenize_dataset tokenized_datasets = dataset.map(tokenize_function, batched=True)

leanwang326 commented 9 months ago

真不好意思，这个我自己跑的时候没出现过，一时间也想不到是什么原因。或者可以提供更详细的信息，比如使用的数据集/模型吗？一个可能是./icl/utils/data_wrapper的tokenize_dataset的tokenizer.max_len_single_sentence出了问题？这个按理说是配置成和模型能接受的长度一致的，也许这儿有什么问题

UGUESS-lzx commented 9 months ago

您好，使用的是原本repo中所用的数据集和model，具体如下图：我分别试了sst2和agnews都是这个报错，请问是哪个参数需要调整的吗？

leanwang326 commented 9 months ago

这个参数应该不需要调整，我这边按这个参数跑出来没有问题。不好意思，我暂时也没看出是哪里的问题，真的很抱歉。如果您有了进一步的发现或者信息，可以和我联系。

UGUESS-lzx commented 9 months ago

请问你们有额外地设置tokenizer.max_len_single_sentence的值吗？我使用原本的代码输出tokenizer.max_len_single_sentence的值为1000000000000000019884624838656，这是不是不太正常？请问这可能是什么地方出了问题呢？

leanwang326 commented 9 months ago

啊那就应该是这个问题了，按理说这个在这对于gpt2应该是1024，我们代码应该没改这个，不知道是不是tokenizer加载的时候的版本的问题/你本地的tokenizer这个参数不对，你可以把这个改成1024/或者改成tokenizer.model_max_length（如果后者这儿是1024的话），我之后在代码这个部分也加个检查语句，谢谢你了

leanwang326 commented 9 months ago

如果还有啥问题请再和我说

UGUESS-lzx commented 9 months ago

不好意思请问这里应该怎么修改这个参数呢？我使用model = LoadClass.from_pretrained(folder_path, max_len_single_sentence=1024)他仍然显示 tokenizer.max_len_single_sentence:1000000000000000019884624838656 tokenizer.model_max_length:1000000000000000019884624838656 而huggingface上的gpt2-xl中并没有tokenizer_config.json，config.json中也没有找到相应的参数

leanwang326 commented 9 months ago

在data_wrapper.py里我新加了一个default_max_length_dict = {}来允许指定，你可以加一个"gpt2-xl":1024来实现这一点

UGUESS-lzx commented 9 months ago

好的，非常非常感谢！（不过你们这里前一个max_length应该要删掉，同时如果是调用的本地的模型的话应该是"model_path":1024）

leanwang326 commented 9 months ago

啊本地模型的话确实，我再标注一下。话说max_length应该没问题，可能你更新代码的时候多了一行，我看github仓库里的代码没问题

UGUESS-lzx commented 9 months ago

哦哦好的，再次感谢！