Closed xiajing10 closed 1 year ago
不是bug,命名也并不混淆。
TSVTaggingDataset
是torch.utils.data.dataset.Dataset
的子类,当然跟TensorFlow实现没关系,TensorFlow实现也不可能用它。谢谢!因为hanlp finetune参数支持的pretrained ner模型里,三个中文模型里只有MSRA_NER_ELECTRA_SMALL_ZH可以用torch模型load,但也无法训练(evaluation全部是0),有什么建议吗?
可能是你的finetune data太小了。另外,finetune不支持vocab之外的新tag。
Describe the bug TransformerNamedEntityRecognizerTF 无法给输入的data按照max_seq_length分开,总是读了开头就将后续数据全部报错超出max_seq_length。经过比较transformer_ner里的TransformerNamedEntityRecognizer,怀疑是缺少build_dataset TSVTaggingDataset和TSVTaggingDataset。
Code to reproduce the issue Provide a reproducible test case that is the bare minimum necessary to generate the problem.
同样的data换成TransformerNamedEntityRecognizer和MSRA_NER_ELECTRA_SMALL_ZH后就能正常训练模型。 而且max_seq_len和max_seq_length参数名混淆比较多,有的fit用前者有的fit用后者。
Describe the current behavior WARNING Input tokens [...] exceed the max sequence length of 126. The exceeded part will be truncated and ignored. You are recommended to split your long text into several sentences within 126 tokens beforehand.
数据以tsv形式读取后没有在后续以max_seq_length进行split。
Expected behavior A clear and concise description of what you expected to happen.
System information OS Platform and Distribution (e.g., Linux Ubuntu 16.04):MacOS Python version:3.9 HanLP version:2.1.0b50
Other info / logs no