luopeixiang / named_entity_recognition

中文命名实体识别(包括多种模型:HMM,CRF,BiLSTM,BiLSTM+CRF的具体实现)
2.13k stars 537 forks source link

potential fix in build_corpus #33

Open mikelty opened 3 years ago

mikelty commented 3 years ago

I changed an if-else block to try-except block and it worked. Machine: windows10, python3.7 also i need another sklearn package after i installed requirements.txt i think this is due to a syntactical difference between bmes format and a windows file reader. idk.

def build_corpus(split, make_vocab=True, data_dir="./ResumeNER"):
    """读取数据"""
    assert split in ['train', 'dev', 'test']

    word_lists = []
    tag_lists = []
    with open(join(data_dir, split+".char.bmes"), 'r', encoding='utf-8') as f:
        word_list = []
        tag_list = []
        for line in f.readlines():
            try:
                word, tag = line.strip('\n').split()
                word_list.append(word)
                tag_list.append(tag)
            except:
                word_lists.append(word_list)
                tag_lists.append(tag_list)
                word_list = []
                tag_list = []

    # 如果make_vocab为True,还需要返回word2id和tag2id
    if make_vocab:
        word2id = build_map(word_lists)
        tag2id = build_map(tag_lists)
        return word_lists, tag_lists, word2id, tag2id
    else:
        return word_lists, tag_lists