面向中文数据集时，如何用T5-pegasus-small来替换uie-char-small进行finetune

universal-ie / UIE

Unified Structure Generation for Universal Information Extraction

900 stars 99 forks source link

面向中文数据集时，如何用T5-pegasus-small来替换uie-char-small进行finetune #33

Open ygxw0909 opened 2 years ago

ygxw0909 commented 2 years ago

目前我将uie-small-char的added tokens替换到了T5-pegasus-small的vocab.txt中的unused token，同时将prefix_max_len设为0取消了SSl，但是出现了各种各样的报错，请问作者能否提供一下做对比实验室用其他预训练模型（中文更佳）的方法？

luyaojie commented 2 years ago

你好，我们的对比实验是基于T5模型，使用的是相同的Tokenizer。

适配到不同的模型，主要就是修改不同的Tokenizer。但是不同预训练模型的预训练方式不同，具体的模型可能适配方式不同，应该也可以适配了别的 tokenizer 和生成模型。

ref https://github.com/luyaojie/Text2Event/issues/18

ygxw0909 commented 2 years ago

感谢您的回答！

我这边的报错信息是： ValueError: Unable to create tensor, you should probably activate truncation and/or padding with 'padding=True' 'truncation=True' to have batched tensors with the same length.

我的运行命令是 nohup bash -u run_uie_finetune.bash -v -d 2 -b 32 -k 1 --lr 1e-4 --warmup_ratio 0.06 -i entity_zh/data --epoch 30 --map_config config/offset_map/longer_first_offset_zh.yaml -m hf_models/t5-char-pegasus >> nohup.out &

除了替换模型以及将max_prefix_len设为0之外我没有做任何其他的变更，报错发生在将train()函数将数据喂给模型的过程中，应该不是tokenizer或者decoder的问题，您这边有什么头绪吗？

ygxw0909 commented 2 years ago

我尽可能的输出了所有能打印的数据，并没有发现数据维度层面，padding层面的问题

luyaojie commented 2 years ago

请问使用的tokenizer是仓库代码默认的还是 t5-char-pegasus 的，

ygxw0909 commented 2 years ago

使用的是t5-char-pegasus的tokenizer，同时我将代码运行所需的等额外的tokens替换了t5-char-pegasus的vocab.txt中的所预留的unused位置

luyaojie commented 2 years ago

我不太了解 t5-char-pegasus 这个模型，看起来是 tokenizer 的类方法没有对齐

ygxw0909 commented 2 years ago

好的，我再尝试一下，感谢您的回答！

xxllp commented 2 years ago

要是能直接适配中文t5就好了

zdgithub commented 2 years ago

目前我将uie-small-char的added tokens替换到了T5-pegasus-small的vocab.txt中的unused token，同时将prefix_max_len设为0取消了SSl，但是出现了各种各样的报错，请问作者能否提供一下做对比实验室用其他预训练模型（中文更佳）的方法？

我发现BertTokenzier 和 T5Tokenizer完全不同，前者把中文分字，但是后者会分词。

@luyaojie 您好，我想复现下论文中T5-v1.1-base在CoNLL03数据集上的结果，于是按照epoch=200， lr=1e-4的参数设置fine-tune Google/T5-v1_1-base模型，效果极差，能请问下您在该数据集上fine-tune T5-v1.1-base的超参设置吗？

luyaojie commented 2 years ago

能请问下您在该数据集上fine-tune T5-v1.1-base的超参设置吗？

您好，fine-tune T5-v1.1-base 的超参数与 UIE-base 是一致的，详细列在了论文的Table 6。

zdgithub commented 2 years ago

@luyaojie 感谢回复！对于NER任务，您论文中fine-tune T5-v1.1-base时，是按照mT5 论文中介绍的那样，将多个实体直接拼接成text吗？比如：

input text: London is the hometown of Michael.
output text: LOC: London; PER: Michael 这样组织的NER任务吗？