stanleylsx / entity_extractor_by_ner

基于Tensorflow2.3开发的NER模型,都是CRF范式,包含Bilstm(IDCNN)-CRF、Bert-Bilstm(IDCNN)-CRF、Bert-CRF,可微调预训练模型,可对抗学习,用于命名实体识别,配置后可直接运行。
390 stars 73 forks source link

导入自己标注的文本,显示数据集为0的问题 #43

Closed LakersUpAma closed 2 years ago

LakersUpAma commented 2 years ago

采用自建的数据集训练时,标签的数量能否大于三个呢

LakersUpAma commented 2 years ago

自定义标签后,出现keyerror的问题。 Traceback (most recent call last): File "D:/PyCharm/PycharmProjects/entity_extractor_by_ner-master/main.py", line 72, in train(configs, dataManager, logger) File "D:\PyCharm\PycharmProjects\entity_extractor_by_ner-master\engines\train.py", line 50, in train train_dataset, val_dataset = data_manager.get_training_set() File "D:\PyCharm\PycharmProjects\entity_extractor_by_ner-master\engines\data.py", line 249, in get_training_set df_train['label_id'] = df_train.label.map(lambda x: -1 if str(x) == str(np.nan) else self.label2id[x]) File "D:\Anaconda\envs\tf2.3\lib\site-packages\pandas\core\series.py", line 3828, in map new_values = super()._map_values(arg, na_action=na_action) File "D:\Anaconda\envs\tf2.3\lib\site-packages\pandas\core\base.py", line 1300, in _map_values new_values = map_f(values, mapper) File "pandas/_libs/lib.pyx", line 2228, in pandas._libs.lib.map_infer File "D:\PyCharm\PycharmProjects\entity_extractor_by_ner-master\engines\data.py", line 249, in df_train['label_id'] = df_train.label.map(lambda x: -1 if str(x) == str(np.nan) else self.label2id[x]) KeyError: 'B-se'

stanleylsx commented 2 years ago

自定义标签后,出现keyerror的问题。 Traceback (most recent call last): File "D:/PyCharm/PycharmProjects/entity_extractor_by_ner-master/main.py", line 72, in train(configs, dataManager, logger) File "D:\PyCharm\PycharmProjects\entity_extractor_by_ner-master\engines\train.py", line 50, in train train_dataset, val_dataset = data_manager.get_training_set() File "D:\PyCharm\PycharmProjects\entity_extractor_by_ner-master\engines\data.py", line 249, in get_training_set df_train['label_id'] = df_train.label.map(lambda x: -1 if str(x) == str(np.nan) else self.label2id[x]) File "D:\Anaconda\envs\tf2.3\lib\site-packages\pandas\core\series.py", line 3828, in map new_values = super()._map_values(arg, na_action=na_action) File "D:\Anaconda\envs\tf2.3\lib\site-packages\pandas\core\base.py", line 1300, in _map_values new_values = map_f(values, mapper) File "pandas/_libs/lib.pyx", line 2228, in pandas._libs.lib.map_infer File "D:\PyCharm\PycharmProjects\entity_extractor_by_ner-master\engines\data.py", line 249, in df_train['label_id'] = df_train.label.map(lambda x: -1 if str(x) == str(np.nan) else self.label2id[x]) KeyError: 'B-se'

你是在配置文件里面配置的标签还是自己修改的label2id,如果是后者,label2id不需要你改的,他是根据你的数据集自动生成的。如果是前者应该是有bug

LakersUpAma commented 2 years ago

您好,没有改过label2id。我是自己标注的标签,用txt文件标好以后改后缀生成的csv文件,就出现了KeyError这种错误。但我将您提供的数据集的一部分复制到自建的路径下以后,即使标签还是ORG,PER,LOC,仍然会出现KerError的问题,无法识别B-LOC。不知道这个问题跟txt文件改后缀有没有关系。

LakersUpAma commented 2 years ago

不好意思,keyError问题解决了,原因是第一次训练生成的label2id文件没有删除,导致标签没有对应上。 但是,现在虽然可以生成相应的label2id文件,token2id文件也能识别,但是显示的训练集数据和验证集数据均为0。 validating set is not exist, built... training set size: 0, validating set size: 0

LakersUpAma commented 2 years ago

这个问题应该也解决了,应该是我标注样本不够的问题

LakersUpAma commented 2 years ago

样本标注1500也出现training set size: 0, validating set size: 0,但用您预留数据集中的100个样本也能训练。

stanleylsx commented 2 years ago

样本标注1500也出现training set size: 0, validating set size: 0,但用您预留数据集中的100个样本也能训练。

你好 你的数据集太少 且 只有一段文本 无法分割验证集。 请参照我的数据集 至少给多点不同的文本进行训练 不同的文本之间需要用换行空开