jsksxs360 / How-to-use-Transformers

Transformers 库快速入门教程
https://transformers.run/
Apache License 2.0
1.06k stars 134 forks source link

关于第十章:翻译任务中transformers部分API的变更 #27

Open PikaChyou opened 1 month ago

PikaChyou commented 1 month ago

第十章:翻译任务中提到的默认分词器编码设定采用的上下文管理器 as_target_tokenizer() 即将被废弃

默认情况下分词器会采用源语言的设定来编码文本,要编码目标语言则需要通过上下文管理器 as_target_tokenizer()

 zh_sentence = train_data[0]["chinese"]
 en_sentence = train_data[0]["english"]

 inputs = tokenizer(zh_sentence)
 with tokenizer.as_target_tokenizer():
    targets = tokenizer(en_sentence)

在当前版本的transformer中使用 as_target_tokenizer() 虽然还能够正常运行,但将会给出一个warning提示该API将会在下一个大版本中被移除

UserWarning: `as_target_tokenizer` is deprecated and will be removed in v5 of Transformers. You can tokenize your labels by using the argument `text_target` of the regular `__call__` method (either in the same call as your input texts if you use the same keyword arguments, or in a separate call.
  warnings.warn(

huggingface官方推荐的做法为改用 text_target 参数进行编码,在官方文档中有详细的注解

因此个人推荐将原文的表述更改为

默认情况下分词器会采用源语言的设定来编码文本,要编码目标语言则需要使用参数 text_target

 zh_sentence = train_data[0]["chinese"]
 en_sentence = train_data[0]["english"]

 inputs = tokenizer(zh_sentence)
 targets = tokenizer(text_target=en_sentence)
jsksxs360 commented 1 month ago

非常感谢!已经对教程和代码进行了更新。