CAN-NER data_process.py segmentation part

lvjiujin commented 3 years ago

`

 for r in sentences:
    lines = r.split('\n')
    for index, line in enumerate(lines):
        text = line.strip().split("\t")
        # text = line.strip().split(" ")

        word = text[0][0]
        seg = text[0][1]
        tag = text[1]
        word = normalize_word(word)
        if word not in word2id:
            # word 没有在word2id中，就映射为<UNK>, <UNK>的id为1.

            word2id[word] = 1
        if tag not in tag2id:
            print(tag)
            tag2id[tag] = len(tag2id)

        if index == len(lines) - 1:
            # 最后一行处理分词的逻辑没有看懂？
            if seg == "0":
                seg = 4
            else:
                seg = 3
            sent.append([word2id[word], tag2id[tag], seg])
            ret = getDataFromSent_with_seg_test(sent)
            rs.append(ret)
            sent = []

        else:
            next_seg = lines[index + 1].split(" ")[0][1]
            if seg == "0":
                if next_seg == "0":
                    seg = 4
                else:
                    seg = 1
            else:
                if next_seg == "0":
                    seg = 3
                else:
                    seg = 2
            sent.append([word2id[word], tag2id[tag], seg])

`

data_process.py 中处理分词没怎么看懂？就是上述代码，当分词为最后一行时seg ==='0'替换为4，否则替换为3. 不是最后一行的话，若为0就看下一个seg是否为0，若为0则替换为4，否则替换为1 等等。这个逻辑是什么？能解释一下吗？

slb9712 commented 3 years ago

朋友，你跑的时候数据集是什么格式，我用之前的数据集，在seg = text[0][1]这里会报超过范围，好像我的格式不太对，如果你能跑通的话希望你能解答一下，或者分享一下你的数据集

wgx998877 commented 3 years ago

the seg info seems be mapped to the tag of "BIES"(begin, inside, end and single), and from this code section, 4 probably means S as single word. you can double check with the full implementation.

microsoft / vert-papers

CAN-NER data_process.py segmentation part #10