hankcs / HanLP

中文分词 词性标注 命名实体识别 依存句法分析 成分句法分析 语义依存分析 语义角色标注 指代消解 风格转换 语义相似度 新词发现 关键词短语提取 自动摘要 文本分类聚类 拼音简繁转换 自然语言处理
https://hanlp.hankcs.com/
Apache License 2.0
33.97k stars 10.18k forks source link

TransformerNamedEntityRecognizerTF 无法识别data的max_seq_length #1847

Closed xiajing10 closed 1 year ago

xiajing10 commented 1 year ago

Describe the bug TransformerNamedEntityRecognizerTF 无法给输入的data按照max_seq_length分开,总是读了开头就将后续数据全部报错超出max_seq_length。经过比较transformer_ner里的TransformerNamedEntityRecognizer,怀疑是缺少build_dataset TSVTaggingDataset和TSVTaggingDataset。

Code to reproduce the issue Provide a reproducible test case that is the bare minimum necessary to generate the problem.

import hanlp
from hanlp.components.ner.ner_tf import TransformerNamedEntityRecognizerTF

recognizer = TransformerNamedEntityRecognizerTF()
save_dir = 'data/model/ner/finetune_ner_bert-base-chinese_msra'
CONLL03_RESUME_TRAIN="train.tsv"
CONLL03_RESUME_TEST="val.tsv"
recognizer.fit(CONLL03_RESUME_TRAIN, CONLL03_RESUME_TEST, save_dir,
                epochs=3,
                adam_epsilon=1e-6,
                warmup_steps=0.1,
                weight_decay=0.01,
                word_dropout=0.1,  
                max_seq_len=512,
                char_level=True,
                hard_constraint=True,
                transformer='bert-base-chinese',
                finetune=hanlp.pretrained.ner.MSRA_NER_BERT_BASE_ZH,
                seed=0)

同样的data换成TransformerNamedEntityRecognizer和MSRA_NER_ELECTRA_SMALL_ZH后就能正常训练模型。 而且max_seq_len和max_seq_length参数名混淆比较多,有的fit用前者有的fit用后者。

Describe the current behavior WARNING Input tokens [...] exceed the max sequence length of 126. The exceeded part will be truncated and ignored. You are recommended to split your long text into several sentences within 126 tokens beforehand.

数据以tsv形式读取后没有在后续以max_seq_length进行split。

Expected behavior A clear and concise description of what you expected to happen.

System information OS Platform and Distribution (e.g., Linux Ubuntu 16.04):MacOS Python version:3.9 HanLP version:2.1.0b50

Other info / logs no

hankcs commented 1 year ago

不是bug,命名也并不混淆。

  1. TransformerNamedEntityRecognizerTF的max_seq_length用来compile一个keras的静态计算图。超过该max_seq_length的sequence是用户的责任"You are recommended to split your long text into several sentences within 126 tokens beforehand."
  2. TransformerNamedEntityRecognizer是pytorch实现,其TSVTaggingDatasettorch.utils.data.dataset.Dataset的子类,当然跟TensorFlow实现没关系,TensorFlow实现也不可能用它。
  3. TensorFlow是一个糟糕的产品,bug层出不穷。最近它放弃了windows支持,与torch的竞争基本完败,Google已经基本放弃它了。HanLP的TF实现只是为了兼容,并不准备投入有限的精力开发重复的功能。Since TransformerNamedEntityRecognizer works for you, why not just stick to it?
xiajing10 commented 1 year ago

谢谢!因为hanlp finetune参数支持的pretrained ner模型里,三个中文模型里只有MSRA_NER_ELECTRA_SMALL_ZH可以用torch模型load,但也无法训练(evaluation全部是0),有什么建议吗?

hankcs commented 1 year ago

可能是你的finetune data太小了。另外,finetune不支持vocab之外的新tag。