hankcs / HanLP

Natural Language Processing for the next decade. Tokenization, Part-of-Speech Tagging, Named Entity Recognition, Syntactic & Semantic Dependency Parsing, Document Classification
https://hanlp.hankcs.com/en/
Apache License 2.0
33.73k stars 10.08k forks source link

mtl.evaluate报出 AssertionError: No samples loaded #1704

Closed Yumeka999 closed 2 years ago

Yumeka999 commented 2 years ago

Describe the bug mtl.evaluate报出 AssertionError: No samples loaded

Code to reproduce the issue Provide a reproducible test case that is the bare minimum necessary to generate the problem.

import os
from hanlp.common.dataset import SortingSamplerBuilder
from hanlp.common.transform import NormalizeCharacter
from hanlp.components.mtl.multi_task_learning import MultiTaskLearning
from hanlp.components.mtl.tasks.ner.tag_ner import TaggingNamedEntityRecognition
from hanlp.components.mtl.tasks.pos import TransformerTagging
from hanlp.components.mtl.tasks.tok.tag_tok import TaggingTokenization
from hanlp.layers.embeddings.contextual_word_embedding import ContextualWordEmbedding
from hanlp.layers.transformers.relative_transformer import RelativeTransformerEncoder
from hanlp.utils.lang.zh.char_table import HANLP_CHAR_TABLE_JSON
from hanlp.utils.log_util import cprint

root = os.path.abspath(os.path.join(os.path.dirname(__file__), os.pardir))

def cdroot():
    os.chdir(root)

F_LEARN_RATE = 1e-3
N_EPOCH = 1
N_BATCH_SIZE = 16
N_MAX_SEQ_LEN = 510

CTB8_CWS_TRAIN = r"/home/_tmp/data/ctb8_cn/tasks/cws/train.txt"
CTB8_CWS_DEV = r"/home/_tmp/data/ctb8_cn/tasks/cws/dev.txt"
CTB8_CWS_TEST = r"/home/_tmp/data/ctb8_cn/tasks/cws/test.txt"

CTB8_POS_TRAIN = r"/home/_tmp/data/ctb8_cn/tasks/pos/train.txt"
CTB8_POS_DEV = r"/home/_tmp/data/ctb8_cn/tasks/pos/dev.txt"
CTB8_POS_TEST = r"/home/_tmp/data/ctb8_cn/tasks/pos/test.txt"

S_NER_DATA_DIR = r"/home/_tmp/data/msra_ner_token_level_cn/"
MSRA_NER_TOKEN_LEVEL_SHORT_IOBES_TRAIN = os.path.join(S_NER_DATA_DIR, "word_level.train.tsv")
MSRA_NER_TOKEN_LEVEL_SHORT_IOBES_DEV = os.path.join(S_NER_DATA_DIR, "word_level.dev.tsv")
MSRA_NER_TOKEN_LEVEL_SHORT_IOBES_TEST = os.path.join(S_NER_DATA_DIR, "word_level.test.tsv")

S_SAV_MODEL_DIR = '/home/_tmp/data/model/mtl/open_tok_pos_ner_srl_dep_sdp_con_electra_small'
S_PERTRAINED_DIR = r"/home/data_sync/pretrain_model/hfl__chinese-electra-180g-small-discriminator"

print(CTB8_CWS_TRAIN)
print(CTB8_CWS_DEV)
print(CTB8_CWS_TEST)
print()
print(CTB8_POS_TRAIN)
print(CTB8_POS_DEV)
print(CTB8_POS_TEST)
print()
print(MSRA_NER_TOKEN_LEVEL_SHORT_IOBES_TRAIN)
print(MSRA_NER_TOKEN_LEVEL_SHORT_IOBES_DEV)
print(MSRA_NER_TOKEN_LEVEL_SHORT_IOBES_TEST)

tasks = {
    'tok': TaggingTokenization(  # 分词
        CTB8_CWS_TRAIN,
        CTB8_CWS_DEV,
        CTB8_CWS_TEST,
        SortingSamplerBuilder(batch_size=N_BATCH_SIZE),
        max_seq_len=N_MAX_SEQ_LEN,
        hard_constraint=True,
        char_level=True,
        tagging_scheme='BMES',
        lr=F_LEARN_RATE,
        transform=NormalizeCharacter(HANLP_CHAR_TABLE_JSON, 'token'),
    ),
    # 'pos': TransformerTagging(  # 词性
    #     CTB8_POS_TRAIN,
    #     CTB8_POS_DEV,
    #     CTB8_POS_TEST,
    #     SortingSamplerBuilder(batch_size=N_BATCH_SIZE),
    #     hard_constraint=True,
    #     max_seq_len=N_MAX_SEQ_LEN,
    #     char_level=True,
    #     dependencies='tok',
    #     lr=1e-3,
    # ),
    'ner': TaggingNamedEntityRecognition(  # 实体  mat multi mat erro
        MSRA_NER_TOKEN_LEVEL_SHORT_IOBES_TRAIN,
        MSRA_NER_TOKEN_LEVEL_SHORT_IOBES_DEV,
        MSRA_NER_TOKEN_LEVEL_SHORT_IOBES_TEST,
        SortingSamplerBuilder(batch_size=N_BATCH_SIZE),
        lr=F_LEARN_RATE,
        # secondary_encoder=RelativeTransformerEncoder(768, k_as_x=True),  # base
        secondary_encoder=RelativeTransformerEncoder(256, k_as_x=True),   # small
        dependencies='tok',
    ),
    # 'dep': BiaffineDependencyParsing(
    #     CTB8_SD330_TRAIN,
    #     CTB8_SD330_DEV,
    #     CTB8_SD330_TEST,
    #     SortingSamplerBuilder(batch_size=32),
    #     lr=1e-3,
    #     tree=True,
    #     punct=True,
    #     dependencies='tok',
    # ),
    # 'con': CRFConstituencyParsing(
    #     CTB8_BRACKET_LINE_NOEC_TRAIN,
    #     CTB8_BRACKET_LINE_NOEC_DEV,
    #     CTB8_BRACKET_LINE_NOEC_TEST,
    #     SortingSamplerBuilder(batch_size=32),
    #     lr=1e-3,
    #     dependencies='tok',
    # )
}

mtl = MultiTaskLearning()

mtl.load(S_SAV_MODEL_DIR)
print(mtl('华纳音乐旗下的新垣结衣在12月21日于日本武道馆举办歌手出道活动'))
for k, v in tasks.items():
    v.trn = tasks[k].trn
    v.dev = tasks[k].dev
    v.tst = tasks[k].tst
metric, *_ = mtl.evaluate(S_SAV_MODEL_DIR)
for k, v in tasks.items():
    print(metric[k], end=' ')
print()

Describe the current behavior 运行报错 File "/home/xy/miniconda3/envs/py364_xy/lib/python3.6/site-packages/hanlp/common/dataset.py", line 129, in init assert data, 'No samples loaded' AssertionError: No samples loaded

Expected behavior 程序正常输出样本集上P R F

System information

Other info / logs /home/_tmp/data/ctb8_cn/tasks/cws/train.txt /home/_tmp/data/ctb8_cn/tasks/cws/dev.txt /home/_tmp/data/ctb8_cn/tasks/cws/test.txt

/home/_tmp/data/ctb8_cn/tasks/pos/train.txt /home/_tmp/data/ctb8_cn/tasks/pos/dev.txt /home/_tmp/data/ctb8_cn/tasks/pos/test.txt

/home/_tmp/data/msra_ner_token_level_cn/word_level.train.tsv /home/_tmp/data/msra_ner_token_level_cn/word_level.dev.tsv /home/_tmp/data/msra_ner_token_level_cn/word_level.test.tsv { "tok": [ "华纳", "音乐", "旗下", "的", "新垣结衣", "在", "12月", "21日", "于", "日本", "武道馆", "举办", "歌手", "出道", "活动" ], "ner": [ ["华纳音乐", "ORGANIZATION", 0, 2], ["新垣结衣", "PERSON", 4, 5], ["12月", "DATE", 6, 7], ["21日", "DATE", 7, 8], ["日本", "LOCATION", 9, 10], ["武道馆", "LOCATION", 10, 11] ] } 1 / 2 Building tst dataset for ner ... Traceback (most recent call last): File "hanlp_traincn.py", line 134, in metric, * = mtl.evaluate(S_SAV_MODEL_DIR) File "/home/xy/miniconda3/envs/py364_xy/lib/python3.6/site-packages/hanlp/components/mtl/multi_task_learning.py", line 753, in evaluate rets = super().evaluate('tst', save_dir, logger, batch_size, output, kwargs) File "/home/xy/miniconda3/envs/py364_xy/lib/python3.6/site-packages/hanlp/common/torch_component.py", line 469, in evaluate device=self.devices[0], logger=logger, overwrite=True)) File "/home/xy/miniconda3/envs/py364_xy/lib/python3.6/site-packages/hanlp/components/mtl/multi_task_learning.py", line 156, in build_dataloader cache=isinstance(data, str), config) File "/home/xy/miniconda3/envs/py364_xy/lib/python3.6/site-packages/hanlp/components/mtl/tasks/ner/tag_ner.py", line 123, in build_dataloader dataset = self.build_dataset(data, cache=cache, transform=transform, args) File "/home/xy/miniconda3/envs/py364_xy/lib/python3.6/site-packages/hanlp/components/ner/transformer_ner.py", line 216, in build_dataset dataset = super().build_dataset(data, transform, kwargs) File "/home/xy/miniconda3/envs/py364_xy/lib/python3.6/site-packages/hanlp/components/taggers/transformers/transformer_tagger.py", line 170, in build_dataset return TSVTaggingDataset(data, transform=transform, **kwargs) File "/home/xy/miniconda3/envs/py364_xy/lib/python3.6/site-packages/hanlp/datasets/ner/tsv.py", line 45, in init super().init(data, transform, cache, generate_idx) File "/home/xy/miniconda3/envs/py364_xy/lib/python3.6/site-packages/hanlp/common/dataset.py", line 129, in init assert data, 'No samples loaded' AssertionError: No samples loaded

hankcs commented 2 years ago

无法复现。建议使用HanLP提供的语料库。先把格式弄清楚了再用自己的数据。

Yumeka999 commented 2 years ago

无法复现。建议使用HanLP提供的语料库。先把格式弄清楚了再用自己的数据。 何老师,这数据是HanLP提供的语料库,只不过CTB8_CWS_TRAIN原本是URL,我将URL对应的文件下载到本地解压

/home/_tmp/data/ctb8_cn/tasks/cws/train.txt /home/_tmp/data/ctb8_cn/tasks/pos/train.txt /home/_tmp/data/msra_ner_token_level_cn/word_level.train.tsv 这几个数据集文件是从 from hanlp.datasets.ner.msra import MSRA_NER_TOKEN_LEVEL_SHORT_IOBES_TRAIN from hanlp.datasets.parsing.ctb8 import CTB8_POS_TRAIN 中MSRA_NER_TOKEN_LEVEL_SHORT_IOBES_TRAIN和CTB8_POS_TRAIN对应的URL https://wakespace.lib.wfu.edu/bitstream/handle/10339/39379/LDC2013T21.tgz#data/tasks/cws/train.txt http://file.hankcs.com/corpus/msra_ner_token_level.zip#word_level.train.short.tsv 里下载并解压后得到的数据,是hanlp_demo里的数据 现在使用github的hanlp_demo里的数据运行出现上述问题

hankcs commented 2 years ago

无法复现,不要手动下载HanLP的数据,因为你根本不知道怎么预处理这些语料。HanLP会自动帮你下载+预处理。你要是配不好环境的话就用colab:https://colab.research.google.com/drive/1qE2OkSTluMWZjfrj01iWrDOO8HisAmBk?usp=sharing

Yumeka999 commented 2 years ago

好,多谢