renmada / t5-pegasus-pytorch

399 stars 60 forks source link

运行摘要示例报错 #31

Open saymyname77 opened 2 years ago

saymyname77 commented 2 years ago

文件尾部添加: if name == 'main': generate( '谷歌旗下的YouTube表示,自去年以来,已有13万个视频从其平台上删除,当时它禁止传播有关Covid疫苗的错误信息的内容。在一篇博客文章中,该公司表示,它已经看到有关Covid疫苗的虚假声明“蔓延到有关疫苗的错误信息中”。“我们正在扩大我们在YouTube上的医疗错误信息政策,对当前管理的疫苗进行新的指导方针,这些疫苗已被当地卫生当局和世界卫生组织批准并确认为安全有效,”该帖子说,指的是世界卫生组织。',max_length=40) 报错信息: Calling T5PegasusTokenizer.from_pretrained() with the path to a single file or url is deprecated Special tokens have been added in the vocabulary, make sure the associated word embedding are fine-tuned or trained. Truncation was not explicitly activated but max_length is provided a specific value, please use truncation=True to explicitly truncate examples to max length. Defaulting to 'longest_first' truncation strategy. If you encode pairs of sequences (GLUE-style) with the tokenizer you can select this strategy more precisely by providing a specific strategy to truncation. Building prefix dict from the default dictionary ... Loading model from cache /tmp/jieba.cache Loading model cost 0.391 seconds. Prefix dict has been built successfully. Traceback (most recent call last): File "/root/miniconda3/lib/python3.8/site-packages/transformers/tokenization_utils_base.py", line 771, in convert_to_tensors tensor = as_tensor(value) RuntimeError: Could not infer dtype of NoneType

During handling of the above exception, another exception occurred:

Traceback (most recent call last): File "run_csl.py", line 251, in generate( File "run_csl.py", line 171, in generate feature = tokenizer.encode(text, return_token_type_ids=True, return_tensors='pt', File "/root/miniconda3/lib/python3.8/site-packages/transformers/tokenization_utils_base.py", line 2104, in encode encoded_inputs = self.encode_plus( File "/root/miniconda3/lib/python3.8/site-packages/transformers/tokenization_utils_base.py", line 2420, in encode_plus return self._encode_plus( File "/root/miniconda3/lib/python3.8/site-packages/transformers/tokenization_utils.py", line 444, in _encode_plus return self.prepare_for_model( File "/root/miniconda3/lib/python3.8/site-packages/transformers/tokenization_utils_base.py", line 2881, in prepare_for_model batch_outputs = BatchEncoding( File "/root/miniconda3/lib/python3.8/site-packages/transformers/tokenization_utils_base.py", line 276, in init self.convert_to_tensors(tensor_type=tensor_type, prepend_batch_axis=prepend_batch_axis) File "/root/miniconda3/lib/python3.8/site-packages/transformers/tokenization_utils_base.py", line 787, in convert_to_tensors raise ValueError( ValueError: Unable to create tensor, you should probably activate truncation and/or padding with 'padding=True' 'truncation=True' to have batched tensors with the same length.

saymyname77 commented 2 years ago

@renmada

saymyname77 commented 2 years ago

在generate函数里这行feature = tokenizer.encode(text, return_token_type_ids=True, return_tensors='pt', max_length=512)加了padding=True或truncation=True或都加也没用。

renmada commented 2 years ago

不能直接把文本传给模型,要转换成token id

saymyname77 commented 2 years ago

不能直接把文本传给模型,要转换成token id

请问转换是哪个方法,