microsoft / vert-papers

This repository contains code and datasets related to entity/knowledge papers from the VERT (Versatile Entity Recognition & disambiguation Toolkit) project, by the Knowledge Computing group at Microsoft Research Asia (MSRA).
MIT License
271 stars 92 forks source link

CAN-NER data_process.py segmentation part #10

Closed lvjiujin closed 3 years ago

lvjiujin commented 3 years ago

`

 for r in sentences:
    lines = r.split('\n')
    for index, line in enumerate(lines):
        text = line.strip().split("\t")
        # text = line.strip().split(" ")

        word = text[0][0]
        seg = text[0][1]
        tag = text[1]
        word = normalize_word(word)
        if word not in word2id:
            # word 没有在word2id中,就映射为<UNK>, <UNK>的id为1.

            word2id[word] = 1
        if tag not in tag2id:
            print(tag)
            tag2id[tag] = len(tag2id)

        if index == len(lines) - 1:
            # 最后一行处理分词的逻辑没有看懂?
            if seg == "0":
                seg = 4
            else:
                seg = 3
            sent.append([word2id[word], tag2id[tag], seg])
            ret = getDataFromSent_with_seg_test(sent)
            rs.append(ret)
            sent = []

        else:
            next_seg = lines[index + 1].split(" ")[0][1]
            if seg == "0":
                if next_seg == "0":
                    seg = 4
                else:
                    seg = 1
            else:
                if next_seg == "0":
                    seg = 3
                else:
                    seg = 2
            sent.append([word2id[word], tag2id[tag], seg])

`

data_process.py 中处理分词没怎么看懂?就是上述代码,当分词为最后一行时seg ==='0'替换为4,否则替换为3. 不是最后一行的话,若为0就看下一个seg是否为0,若为0则替换为4,否则替换为1 等等。这个逻辑是什么?能解释一下吗?

slb9712 commented 3 years ago

朋友,你跑的时候数据集是什么格式,我用之前的数据集,在seg = text[0][1]这里会报超过范围,好像我的格式不太对,如果你能跑通的话希望你能解答一下,或者分享一下你的数据集

wgx998877 commented 3 years ago

the seg info seems be mapped to the tag of "BIES"(begin, inside, end and single), and from this code section, 4 probably means S as single word. you can double check with the full implementation.