tsujuifu / pytorch_graph-rel

A PyTorch implementation of GraphRel
MIT License
268 stars 54 forks source link

code details #6

Open niartnelis opened 4 years ago

niartnelis commented 4 years ago

I didn't get the data in pkl format, so I want to ask some details in the code: 1.dataloader: for idx, inp, pos, dep_fw, dep_bw, ans_ne, wgt_ne, ans_rel, wgt_rel in ld_ts what are the meanings of these variables?

  1. pos is or not pre-characterized to the sentence.Is this part of speech tagging?
  2. The adjacency matrix in the GCN should obtain the dependency of the related words. Is this the syntax dependency obtained by using the spacy?
tsujuifu commented 4 years ago

Hi, thanks for interested in my work.

Sorry for missing the pre-preprocessing code and data. (I left my previous lab and forgot to backup) I will reproduce and update it later. (since I just start my PhD and it's quite busy these days, maybe after CVPR or ACL)

I try to clarify your problems here

  1. for idx, inp, pos, dep_fw, dep_bw, ans_ne, wgt_ne, ans_rel, wgt_rel in ld_ts
BS: batch_size
SL: sentence_length
ED: embedding_dimension
idx: not so important, you can just ignore it
inp: the input sentence ([BS, SL, ED])
pos: the part-of-speech tag of each word ([BS, SL])
dep_fw: the dependencty adjacency matrix (forward edge) of each word-pair ([BS, SL, SL])
dep_bw: the dependencty adjacency matrix (backward edge) of each word-pair ([BS, SL, SL]) 

ans_ne, ans_rel: the output tag of name entity of each word and relation of each word-pair ([BS, SL] ans [BS, SL, SL])
wgt_ne, wgt_rel: the loss weight of name entity of each word and relation of each word-pair, 1 for those contains name entity or relation, otherwise 0 ([BS, SL], [BS, SL, SL])
  1. pos is the part-of-speech tag of each word and it's from spaCy

  2. the dependency parsing also comes from spaCy (please notice that there can be forward and backward edges of the dependency tree)

niartnelis commented 4 years ago

thank you for your reply!!!

niartnelis commented 4 years ago

thank you for your reply!!!

来自 Outlookhttp://aka.ms/weboutlook


发件人: Tsu-Jui Fu notifications@github.com 发送时间: 2019年9月18日 14:53 收件人: tsujuifu/pytorch_graph-rel pytorch_graph-rel@noreply.github.com 抄送: niartnelis daiminjiang@outlook.com; Author author@noreply.github.com 主题: Re: [tsujuifu/pytorch_graph-rel] code details (#6)

Hi, thanks for interested in my work.

Sorry for missing the pre-preprocessing code and data. (I left my previous lab and forgot to backup) I will reproduce and update it later. (since I just start my PhD and it's quite busy these days, maybe after CVPR or ACL)

Hence, I try to clarify your problems here

  1. for idx, inp, pos, dep_fw, dep_bw, ans_ne, wgt_ne, ans_rel, wgt_rel in ld_ts

BS: batch_size SL: sentence_length ED: embedding_dimension

idx: not so important, you can just ignore it inp: the input sentence ([BS, SL, ED]) pos: the part-of-speech tag of each word ([BS, SL]) dep_fw: the dependencty adjacency matrix (forward edge) of each word-pair ([BS, SL, SL]) dep_bw: the dependencty adjacency matrix (backward edge) of each word-pair ([BS, SL, SL])

ans_ne, ans_rel: the output tag of name entity of each word and relation of each word-pair ([BS, SL] ans [BS, SL, SL]) wgt_ne, wgt_rel: the loss weight of name entity of each word and relation of each word-pair, 1 for those contains name entity or relation, otherwise 0 ([BS, SL], [BS, SL, SL])

  1. pos is the part-of-speech tag of each word and it's from spaCyhttps://spacy.io/usage/linguistic-features#pos-tagging

  2. the dependency parsing also comes from spaCyhttps://spacy.io/usage/linguistic-features#dependency-parse (please notice that there can be forward and backward edges of the dependency tree)

― You are receiving this because you authored the thread. Reply to this email directly, view it on GitHubhttps://github.com/tsujuifu/pytorch_graph-rel/issues/6?email_source=notifications&email_token=AI47XMFCVUDYQQOKR6ANH73QKHF5HA5CNFSM4IXXWOPKYY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOD67AJOI#issuecomment-532546745, or mute the threadhttps://github.com/notifications/unsubscribe-auth/AI47XMEF4OIRJUQXF5V5KCDQKHF5HANCNFSM4IXXWOPA.

akashicMarga commented 4 years ago

@niartnelis did you figure out the data preprocessing part? and dataset format?

LuoXukun commented 4 years ago

Thank you very much! I think I can achieve it now.

LuoXukun commented 4 years ago

Can you tell me what does -1 means in wgt_ne and wgt_rel?

zhhhzhang commented 4 years ago

Can you tell me what does -1 means in wgt_ne and wgt_rel?

Hi, could you share the pre-preprocessing code and data? THx

zhhhzhang commented 4 years ago

Hi, thanks for interested in my work.

Sorry for missing the pre-preprocessing code and data. (I left my previous lab and forgot to backup) I will reproduce and update it later. (since I just start my PhD and it's quite busy these days, maybe after CVPR or ACL)

I try to clarify your problems here

  1. for idx, inp, pos, dep_fw, dep_bw, ans_ne, wgt_ne, ans_rel, wgt_rel in ld_ts
BS: batch_size
SL: sentence_length
ED: embedding_dimension
idx: not so important, you can just ignore it
inp: the input sentence ([BS, SL, ED])
pos: the part-of-speech tag of each word ([BS, SL])
dep_fw: the dependencty adjacency matrix (forward edge) of each word-pair ([BS, SL, SL])
dep_bw: the dependencty adjacency matrix (backward edge) of each word-pair ([BS, SL, SL]) 

ans_ne, ans_rel: the output tag of name entity of each word and relation of each word-pair ([BS, SL] ans [BS, SL, SL])
wgt_ne, wgt_rel: the loss weight of name entity of each word and relation of each word-pair, 1 for those contains name entity or relation, otherwise 0 ([BS, SL], [BS, SL, SL])
  1. pos is the part-of-speech tag of each word and it's from spaCy
  2. the dependency parsing also comes from spaCy (please notice that there can be forward and backward edges of the dependency tree)

Hi, Have you reproduce the pre-preprocessing code and data? Could you share it? THx

zhhhzhang commented 4 years ago

Thanks for you reply! My email is zhhhzhang@foxmail.com

------------------ 原始邮件 ------------------ 发件人: "LuoXukun"<notifications@github.com>; 发送时间: 2020年4月5日(星期天) 上午10:53 收件人: "tsujuifu/pytorch_graph-rel"<pytorch_graph-rel@noreply.github.com>; 抄送: "zhhhzhang"<zhhhzhang@foxmail.com>;"Comment"<comment@noreply.github.com>; 主题: Re: [tsujuifu/pytorch_graph-rel] code details (#6)

I am sorry that I havn't evaluate whether my reproduction is correct, and I will share it with you after I finish it.

------------------&nbsp;原始邮件&nbsp;------------------ 发件人:&nbsp;"zhhhzhang"<notifications@github.com&gt;; 发送时间:&nbsp;2020年4月4日(星期六) 晚上10:58 收件人:&nbsp;"tsujuifu/pytorch_graph-rel"<pytorch_graph-rel@noreply.github.com&gt;; 抄送:&nbsp;"扭转乾坤"<861392914@qq.com&gt;;"Comment"<comment@noreply.github.com&gt;; 主题:&nbsp;Re: [tsujuifu/pytorch_graph-rel] code details (#6)

Hi, thanks for interested in my work.

Sorry for missing the pre-preprocessing code and data. (I left my previous lab and forgot to backup) I will reproduce and update it later. (since I just start my PhD and it's quite busy these days, maybe after CVPR or ACL)

I try to clarify your problems here

for idx, inp, pos, dep_fw, dep_bw, ans_ne, wgt_ne, ans_rel, wgt_rel in ld_ts BS: batch_size SL: sentence_length ED: embedding_dimension idx: not so important, you can just ignore it inp: the input sentence ([BS, SL, ED]) pos: the part-of-speech tag of each word ([BS, SL]) dep_fw: the dependencty adjacency matrix (forward edge) of each word-pair ([BS, SL, SL]) dep_bw: the dependencty adjacency matrix (backward edge) of each word-pair ([BS, SL, SL]) ans_ne, ans_rel: the output tag of name entity of each word and relation of each word-pair ([BS, SL] ans [BS, SL, SL]) wgt_ne, wgt_rel: the loss weight of name entity of each word and relation of each word-pair, 1 for those contains name entity or relation, otherwise 0 ([BS, SL], [BS, SL, SL])
pos is the part-of-speech tag of each word and it's from spaCy

the dependency parsing also comes from spaCy (please notice that there can be forward and backward edges of the dependency tree)

Hi, Have you reproduce the pre-preprocessing code and data? Could you share it? THx

— You are receiving this because you commented. Reply to this email directly, view it on GitHub, or unsubscribe. — You are receiving this because you commented. Reply to this email directly, view it on GitHub, or unsubscribe.

nttmac commented 4 years ago

I am sorry that I havn't evaluate whether my reproduction is correct, and I will share it with you after I finish it. ------------------ 原始邮件 ------------------ 发件人: "zhhhzhang"<notifications@github.com>; 发送时间: 2020年4月4日(星期六) 晚上10:58 收件人: "tsujuifu/pytorch_graph-rel"<pytorch_graph-rel@noreply.github.com>; 抄送: "扭转乾坤"<861392914@qq.com>;"Comment"<comment@noreply.github.com>; 主题: Re: [tsujuifu/pytorch_graph-rel] code details (#6) Hi, thanks for interested in my work. Sorry for missing the pre-preprocessing code and data. (I left my previous lab and forgot to backup) I will reproduce and update it later. (since I just start my PhD and it's quite busy these days, maybe after CVPR or ACL) I try to clarify your problems here for idx, inp, pos, dep_fw, dep_bw, ans_ne, wgt_ne, ans_rel, wgt_rel in ld_ts BS: batch_size SL: sentence_length ED: embedding_dimension idx: not so important, you can just ignore it inp: the input sentence ([BS, SL, ED]) pos: the part-of-speech tag of each word ([BS, SL]) dep_fw: the dependencty adjacency matrix (forward edge) of each word-pair ([BS, SL, SL]) dep_bw: the dependencty adjacency matrix (backward edge) of each word-pair ([BS, SL, SL]) ans_ne, ans_rel: the output tag of name entity of each word and relation of each word-pair ([BS, SL] ans [BS, SL, SL]) wgt_ne, wgt_rel: the loss weight of name entity of each word and relation of each word-pair, 1 for those contains name entity or relation, otherwise 0 ([BS, SL], [BS, SL, SL]) pos is the part-of-speech tag of each word and it's from spaCy the dependency parsing also comes from spaCy (please notice that there can be forward and backward edges of the dependency tree) Hi, Have you reproduce the pre-preprocessing code and data? Could you share it? THx — You are receiving this because you commented. Reply to this email directly, view it on GitHub, or unsubscribe.

你好,请问你复现了吗?我有个疑问,在第二阶段,GCN有num_rel个,相加会使得数值很大,会不会train飞了?

zhhhzhang commented 4 years ago

没有呢,还没要到数据[捂脸] 你有数据么? 发一份呗

------------------ 原始邮件 ------------------ 发件人: "zhengnt"<notifications@github.com>; 发送时间: 2020年4月8日(星期三) 上午10:06 收件人: "tsujuifu/pytorch_graph-rel"<pytorch_graph-rel@noreply.github.com>; 抄送: "zhhhzhang"<zhhhzhang@foxmail.com>;"Comment"<comment@noreply.github.com>; 主题: Re: [tsujuifu/pytorch_graph-rel] code details (#6)

I am sorry that I havn't evaluate whether my reproduction is correct, and I will share it with you after I finish it. … ------------------ 原始邮件 ------------------ 发件人: "zhhhzhang"<notifications@github.com>; 发送时间: 2020年4月4日(星期六) 晚上10:58 收件人: "tsujuifu/pytorch_graph-rel"<pytorch_graph-rel@noreply.github.com>; 抄送: "扭转乾坤"<861392914@qq.com>;"Comment"<comment@noreply.github.com>; 主题: Re: [tsujuifu/pytorch_graph-rel] code details (#6) Hi, thanks for interested in my work. Sorry for missing the pre-preprocessing code and data. (I left my previous lab and forgot to backup) I will reproduce and update it later. (since I just start my PhD and it's quite busy these days, maybe after CVPR or ACL) I try to clarify your problems here for idx, inp, pos, dep_fw, dep_bw, ans_ne, wgt_ne, ans_rel, wgt_rel in ld_ts BS: batch_size SL: sentence_length ED: embedding_dimension idx: not so important, you can just ignore it inp: the input sentence ([BS, SL, ED]) pos: the part-of-speech tag of each word ([BS, SL]) dep_fw: the dependencty adjacency matrix (forward edge) of each word-pair ([BS, SL, SL]) dep_bw: the dependencty adjacency matrix (backward edge) of each word-pair ([BS, SL, SL]) ans_ne, ans_rel: the output tag of name entity of each word and relation of each word-pair ([BS, SL] ans [BS, SL, SL]) wgt_ne, wgt_rel: the loss weight of name entity of each word and relation of each word-pair, 1 for those contains name entity or relation, otherwise 0 ([BS, SL], [BS, SL, SL]) pos is the part-of-speech tag of each word and it's from spaCy the dependency parsing also comes from spaCy (please notice that there can be forward and backward edges of the dependency tree) Hi, Have you reproduce the pre-preprocessing code and data? Could you share it? THx — You are receiving this because you commented. Reply to this email directly, view it on GitHub, or unsubscribe.

你好,请问你复现了吗?我有个疑问,在第二阶段,GCN有num_rel个,相加会使得数值很大,会不会train飞了?

— You are receiving this because you commented. Reply to this email directly, view it on GitHub, or unsubscribe.

LuoXukun commented 4 years ago

I have reproduced a dataset in JSON format with the following code, which is not guaranteed to be the same as the author's implementation.

def tag_graphrel(self, relation_id_path, isTrain = True): """ Tag the source json in NYT based on graphrel Args: isTrain: type. Train is true and test is false. Please ignore it. relation_id_path: The relation-id dict file path. Return: Write the json file into the file. [{ text: [seq_length]. The text word list. pos: [seq_length]. The part-of-speech tag of each word. dep_fw: [seq_length, seq_length]. The dependencty adjacency matrix forward edge) of each word-pair. dep_bw: [seq_length, seq_length]. The dependencty adjacency matrix (backward edge) of each word-pair. ans_ne: [seq_length]. The output tag of name entity of each word. ans_rel: [seq_length, seq_length]. The output relation of each word-pair.

wgt_ne: [seq_length]. The loss weight of name entity of each word, 1 for those contains name entity or relation, otherwise 0.

                #wgt_rel:    [seq_length, seq_length].       The loss weight of relation of each word-pair, 1 for those contains name entity or relation, otherwise 0.
                relationMentions:               The gold relational triples. [{"label", "label_id", "em1Text", "em2Text"}]
            }]
    """

`

    # spacy model
    nlp = spacy.load('en')

    datas = []
    EpochCount = 0
    relation_ids = {"PAD": 0, "None": 1} # 0 for padding and 1 for None relation

    # Load the data from source json.
    with open(self.input_path, "r", encoding="utf-8") as fr:
        for line in fr.readlines():
            line = json.loads(line)
            datas.append(line)

    # Get the relation-id dict and entity-id dict.
    if relation_id_path is None:
        print("Please provide the relation_id file path!")
        exit()
    if os.path.exists(relation_id_path):
        print("There have been the relation_is_file, let's use it!")
        with open(relation_id_path, mode="r", encoding="utf-8") as f:
            for line_id, line in enumerate(f):
                relation_ids = json.loads(line)
    else:
        print("There is not the relation_id_file, let's make it according to the dataset!")
        for data in datas:
            for relation in data["relationMentions"]:
                if self.normalize_text(relation["label"]) != "None":
                    if relation["label"] not in relation_ids.keys():
                        relation_ids[relation["label"]] = len(relation_ids)
        with open(relation_id_path, mode="w", encoding="utf-8") as f:
            relation_ids_str = json.dumps(relation_ids, ensure_ascii=False)
            f.write(relation_ids_str + "\n")     

    print("The number of relations: ", len(relation_ids))
    print("Relations to id: ", relation_ids)

    fw = open(self.output_path, "w+", encoding="utf-8")

    for data in datas:
        EpochCount += 1
        text_tag = {}

        sentText = self.normalize_text(data["sentText"]).rstrip('\n').rstrip("\r")
        sentDoc = nlp(sentText)
        # text:       [seq_length].       The text word list.
        sentWords = [token.text for token in sentDoc]
        text_tag["text"] = sentWords

        text_tag["pos"] = []
        text_tag["dep_fw"] = [[-1] * len(sentWords) for i in range(len(sentWords))]
        text_tag["dep_bw"] = [[-1] * len(sentWords) for i in range(len(sentWords))]
        for token in sentDoc:
            # pos:        [seq_length].       The part-of-speech tag of each word.
            text_tag["pos"].append(token.pos)
            # dep_fw:     [seq_length, seq_length].       The dependencty adjacency matrix (forward edge) of each word-pair.
            # dep_bw:     [seq_length, seq_length].       The dependencty adjacency matrix (backward edge) of each word-pair.
            if token.i >= token.head.i:
                text_tag["dep_fw"][token.i][token.head.i] = token.dep
            else:
                text_tag["dep_bw"][token.i][token.head.i] = token.dep

        # ans_ne:     [seq_length].                   The output tag og name entity of each word.
        text_tag["ans_ne"] = ["O"] * len(sentWords)
        for entity in data["entityMentions"]:
            entity_doc = nlp(self.normalize_text(entity["text"]))
            entity_list = [token.text for token in entity_doc]
            entity_idxs = self.find_all_index(sentWords, entity_list)
            for index in entity_idxs:
                if index[1] - index[0] == 1:
                    text_tag["ans_ne"][index[0]] = "S-" + entity["label"]
                elif index[1] - index[0] == 2:
                    text_tag["ans_ne"][index[0]] = "B-" + entity["label"]
                    text_tag["ans_ne"][index[1] - 1] = "E-" + entity["label"]
                elif index[1] - index[0] > 2:
                    for i in range(index[0], index[1]):
                        text_tag["ans_ne"][i] = "I-" + entity["label"]
                    text_tag["ans_ne"][index[0]] = "B-" + entity["label"]
                    text_tag["ans_ne"][index[1] - 1] = "E-" + entity["label"]

        # ans_rel:    [seq_length, seq_length].       The output relation of each word-pair.
        # relationMentions:               The gold relational triples.
        text_tag["ans_rel"] = [[1] * len(sentWords) for i in range(len(sentWords))]
        text_tag["relationMentions"] = []
        for relation in data["relationMentions"]:
            entity1_list = [token.text for token in nlp(self.normalize_text(relation["em1Text"]))]
            entity2_list = [token.text for token in nlp(self.normalize_text(relation["em2Text"]))]
            entity1_idxs = self.find_all_index(sentWords, entity1_list)
            entity2_idxs = self.find_all_index(sentWords, entity2_list)

            for en1_idx in entity1_idxs:
                for en2_idx in entity2_idxs:
                    for i in range(en1_idx[0], en1_idx[1]):
                        for j in range(en2_idx[0], en2_idx[1]):
                            text_tag["ans_rel"][i][j] = relation_ids[relation["label"]]

            relation_item = {}
            if self.normalize_text(relation["label"]) != "None":
                relation_item["label"] = relation["label"]
                relation_item["label_id"] = relation_ids[relation["label"]]
                relation_item["em1Text"] = entity1_list
                relation_item["em2Text"] = entity2_list
                text_tag["relationMentions"].append(relation_item)

        if EpochCount % 10000 == 0:
            print("Epoch ", EpochCount)

        fw.write(json.dumps(text_tag, ensure_ascii=False) + '\n')

    fw.close()
    print("Successfully transfered the file!\n")
    return

`

def normalize_text(self, text): """ Normalize the unicode string. Args: text: unicode string Return: """ `

    return unicodedata.normalize('NFKD', text).encode('ascii','ignore').decode('utf-8')

`

def find_all_index(self, sen_split, word_split): """ Find all loaction of entity in sentence. Args: sen_split: the sentence array. word_split: the entity array. Return: index_list: the list of index pairs. """

`

    start, end, offset = -1, -1, 0
    #print("sen_split: ", sen_split)
    #print("word_split: ", word_split)
    index_list = []
    while(True):
        if len(index_list) != 0:
            offset = index_list[-1][1]
        start, end = self.find_index(sen_split[offset:], word_split)
        if start == -1 and end == -1: break
        if end <= start: break
        start += offset
        end += offset
        index_list.append((start, end))
    return index_list

`

def find_index(self, sen_split, word_split): """ Find the loaction of entity in sentence. Args: sen_split: the sentence array. word_split: the entity array. Return: index1: start index index2: end index """

`

    index1 = -1
    index2 = -1
    for i in range(len(sen_split)):
        if str(sen_split[i]) == str(word_split[0]):
            flag = True
            k = i
            for j in range(len(word_split)):
                if word_split[j] != sen_split[k]:
                    flag = False
                if k < len(sen_split) - 1:
                    k+=1
            if flag:
                index1 = i
                index2 = i + len(word_split)
                break
    return index1, index2

`

Note that: (1) I do not generate wgt_ne and wgt_rel, since you can adapt it through the Args "weight" in loss function nn.CrossEntropyLoss(), you can check the document by yourself. (2) You need to reconstruct dep_fw and dep_bw because there are some too big value generated by spaCy. Map them to smaller integers from 0 for the sake of the next training step. (3) self.input_path: The original NYT dataset. self.output_path: You output file path. (4) You should install the libraries you need. (5) The original NYT dataset is available at here. (6) 感觉有用的话点个赞呗。

niuweicai commented 4 years ago

I have reproduced a dataset in JSON format with the following code, which is not guaranteed to be the same as the author's implementation.

def tag_graphrel(self, relation_id_path, isTrain = True): """ Tag the source json in NYT based on graphrel Args: isTrain: type. Train is true and test is false. Please ignore it. relation_id_path: The relation-id dict file path. Return: Write the json file into the file. [{ text: [seq_length]. The text word list. pos: [seq_length]. The part-of-speech tag of each word. dep_fw: [seq_length, seq_length]. The dependencty adjacency matrix forward edge) of each word-pair. dep_bw: [seq_length, seq_length]. The dependencty adjacency matrix (backward edge) of each word-pair. ans_ne: [seq_length]. The output tag of name entity of each word. ans_rel: [seq_length, seq_length]. The output relation of each word-pair.

wgt_ne: [seq_length]. The loss weight of name entity of each word, 1 for those contains name entity or relation, otherwise 0.

wgt_rel: [seq_length, seq_length]. The loss weight of relation of each word-pair, 1 for those contains name entity or relation, otherwise 0.

relationMentions: The gold relational triples. [{"label", "label_id", "em1Text", "em2Text"}] }] """ `

    # spacy model
    nlp = spacy.load('en')

    datas = []
    EpochCount = 0
    relation_ids = {"PAD": 0, "None": 1} # 0 for padding and 1 for None relation

    # Load the data from source json.
    with open(self.input_path, "r", encoding="utf-8") as fr:
        for line in fr.readlines():
            line = json.loads(line)
            datas.append(line)

    # Get the relation-id dict and entity-id dict.
    if relation_id_path is None:
        print("Please provide the relation_id file path!")
        exit()
    if os.path.exists(relation_id_path):
        print("There have been the relation_is_file, let's use it!")
        with open(relation_id_path, mode="r", encoding="utf-8") as f:
            for line_id, line in enumerate(f):
                relation_ids = json.loads(line)
    else:
        print("There is not the relation_id_file, let's make it according to the dataset!")
        for data in datas:
            for relation in data["relationMentions"]:
                if self.normalize_text(relation["label"]) != "None":
                    if relation["label"] not in relation_ids.keys():
                        relation_ids[relation["label"]] = len(relation_ids)
        with open(relation_id_path, mode="w", encoding="utf-8") as f:
            relation_ids_str = json.dumps(relation_ids, ensure_ascii=False)
            f.write(relation_ids_str + "\n")     

    print("The number of relations: ", len(relation_ids))
    print("Relations to id: ", relation_ids)

    fw = open(self.output_path, "w+", encoding="utf-8")

    for data in datas:
        EpochCount += 1
        text_tag = {}

        sentText = self.normalize_text(data["sentText"]).rstrip('\n').rstrip("\r")
        sentDoc = nlp(sentText)
        # text:       [seq_length].       The text word list.
        sentWords = [token.text for token in sentDoc]
        text_tag["text"] = sentWords

        text_tag["pos"] = []
        text_tag["dep_fw"] = [[-1] * len(sentWords) for i in range(len(sentWords))]
        text_tag["dep_bw"] = [[-1] * len(sentWords) for i in range(len(sentWords))]
        for token in sentDoc:
            # pos:        [seq_length].       The part-of-speech tag of each word.
            text_tag["pos"].append(token.pos)
            # dep_fw:     [seq_length, seq_length].       The dependencty adjacency matrix (forward edge) of each word-pair.
            # dep_bw:     [seq_length, seq_length].       The dependencty adjacency matrix (backward edge) of each word-pair.
            if token.i >= token.head.i:
                text_tag["dep_fw"][token.i][token.head.i] = token.dep
            else:
                text_tag["dep_bw"][token.i][token.head.i] = token.dep

        # ans_ne:     [seq_length].                   The output tag og name entity of each word.
        text_tag["ans_ne"] = ["O"] * len(sentWords)
        for entity in data["entityMentions"]:
            entity_doc = nlp(self.normalize_text(entity["text"]))
            entity_list = [token.text for token in entity_doc]
            entity_idxs = self.find_all_index(sentWords, entity_list)
            for index in entity_idxs:
                if index[1] - index[0] == 1:
                    text_tag["ans_ne"][index[0]] = "S-" + entity["label"]
                elif index[1] - index[0] == 2:
                    text_tag["ans_ne"][index[0]] = "B-" + entity["label"]
                    text_tag["ans_ne"][index[1] - 1] = "E-" + entity["label"]
                elif index[1] - index[0] > 2:
                    for i in range(index[0], index[1]):
                        text_tag["ans_ne"][i] = "I-" + entity["label"]
                    text_tag["ans_ne"][index[0]] = "B-" + entity["label"]
                    text_tag["ans_ne"][index[1] - 1] = "E-" + entity["label"]

        # ans_rel:    [seq_length, seq_length].       The output relation of each word-pair.
        # relationMentions:               The gold relational triples.
        text_tag["ans_rel"] = [[1] * len(sentWords) for i in range(len(sentWords))]
        text_tag["relationMentions"] = []
        for relation in data["relationMentions"]:
            entity1_list = [token.text for token in nlp(self.normalize_text(relation["em1Text"]))]
            entity2_list = [token.text for token in nlp(self.normalize_text(relation["em2Text"]))]
            entity1_idxs = self.find_all_index(sentWords, entity1_list)
            entity2_idxs = self.find_all_index(sentWords, entity2_list)

            for en1_idx in entity1_idxs:
                for en2_idx in entity2_idxs:
                    for i in range(en1_idx[0], en1_idx[1]):
                        for j in range(en2_idx[0], en2_idx[1]):
                            text_tag["ans_rel"][i][j] = relation_ids[relation["label"]]

            relation_item = {}
            if self.normalize_text(relation["label"]) != "None":
                relation_item["label"] = relation["label"]
                relation_item["label_id"] = relation_ids[relation["label"]]
                relation_item["em1Text"] = entity1_list
                relation_item["em2Text"] = entity2_list
                text_tag["relationMentions"].append(relation_item)

        if EpochCount % 10000 == 0:
            print("Epoch ", EpochCount)

        fw.write(json.dumps(text_tag, ensure_ascii=False) + '\n')

    fw.close()
    print("Successfully transfered the file!\n")
    return

`

def normalize_text(self, text): """ Normalize the unicode string. Args: text: unicode string Return: """ `

    return unicodedata.normalize('NFKD', text).encode('ascii','ignore').decode('utf-8')

`

def find_all_index(self, sen_split, word_split): """ Find all loaction of entity in sentence. Args: sen_split: the sentence array. word_split: the entity array. Return: index_list: the list of index pairs. """

`

    start, end, offset = -1, -1, 0
    #print("sen_split: ", sen_split)
    #print("word_split: ", word_split)
    index_list = []
    while(True):
        if len(index_list) != 0:
            offset = index_list[-1][1]
        start, end = self.find_index(sen_split[offset:], word_split)
        if start == -1 and end == -1: break
        if end <= start: break
        start += offset
        end += offset
        index_list.append((start, end))
    return index_list

`

def find_index(self, sen_split, word_split): """ Find the loaction of entity in sentence. Args: sen_split: the sentence array. word_split: the entity array. Return: index1: start index index2: end index """

`

    index1 = -1
    index2 = -1
    for i in range(len(sen_split)):
        if str(sen_split[i]) == str(word_split[0]):
            flag = True
            k = i
            for j in range(len(word_split)):
                if word_split[j] != sen_split[k]:
                    flag = False
                if k < len(sen_split) - 1:
                    k+=1
            if flag:
                index1 = i
                index2 = i + len(word_split)
                break
    return index1, index2

`

Note that: (1) I do not generate wgt_ne and wgt_rel, since you can adapt it through the Args "weight" in loss function nn.CrossEntropyLoss(), you can check the document by yourself. (2) You need to reconstruct ans_ne and ans_rel because there are some too big value generated by spaCy. Map them to smaller integers from 0 for the sake of the next training step. (3) self.input_path: The original NYT dataset. self.output_path: You output file path. (4) You should install the libraries you need. (5) The original NYT dataset is available at here. (6) 感觉有用的话点个赞呗。

可否发一下这个预处理的完整代码,我看到这些代码是包含在一个类中的。我的邮箱605851710@qq.com 十分感谢!

rsanshierli commented 4 years ago

I have reproduced a dataset in JSON format with the following code, which is not guaranteed to be the same as the author's implementation.

def tag_graphrel(self, relation_id_path, isTrain = True): """ Tag the source json in NYT based on graphrel Args: isTrain: type. Train is true and test is false. Please ignore it. relation_id_path: The relation-id dict file path. Return: Write the json file into the file. [{ text: [seq_length]. The text word list. pos: [seq_length]. The part-of-speech tag of each word. dep_fw: [seq_length, seq_length]. The dependencty adjacency matrix forward edge) of each word-pair. dep_bw: [seq_length, seq_length]. The dependencty adjacency matrix (backward edge) of each word-pair. ans_ne: [seq_length]. The output tag of name entity of each word. ans_rel: [seq_length, seq_length]. The output relation of each word-pair.

wgt_ne: [seq_length]. The loss weight of name entity of each word, 1 for those contains name entity or relation, otherwise 0.

wgt_rel: [seq_length, seq_length]. The loss weight of relation of each word-pair, 1 for those contains name entity or relation, otherwise 0.

relationMentions: The gold relational triples. [{"label", "label_id", "em1Text", "em2Text"}] }] """ `

    # spacy model
    nlp = spacy.load('en')

    datas = []
    EpochCount = 0
    relation_ids = {"PAD": 0, "None": 1} # 0 for padding and 1 for None relation

    # Load the data from source json.
    with open(self.input_path, "r", encoding="utf-8") as fr:
        for line in fr.readlines():
            line = json.loads(line)
            datas.append(line)

    # Get the relation-id dict and entity-id dict.
    if relation_id_path is None:
        print("Please provide the relation_id file path!")
        exit()
    if os.path.exists(relation_id_path):
        print("There have been the relation_is_file, let's use it!")
        with open(relation_id_path, mode="r", encoding="utf-8") as f:
            for line_id, line in enumerate(f):
                relation_ids = json.loads(line)
    else:
        print("There is not the relation_id_file, let's make it according to the dataset!")
        for data in datas:
            for relation in data["relationMentions"]:
                if self.normalize_text(relation["label"]) != "None":
                    if relation["label"] not in relation_ids.keys():
                        relation_ids[relation["label"]] = len(relation_ids)
        with open(relation_id_path, mode="w", encoding="utf-8") as f:
            relation_ids_str = json.dumps(relation_ids, ensure_ascii=False)
            f.write(relation_ids_str + "\n")     

    print("The number of relations: ", len(relation_ids))
    print("Relations to id: ", relation_ids)

    fw = open(self.output_path, "w+", encoding="utf-8")

    for data in datas:
        EpochCount += 1
        text_tag = {}

        sentText = self.normalize_text(data["sentText"]).rstrip('\n').rstrip("\r")
        sentDoc = nlp(sentText)
        # text:       [seq_length].       The text word list.
        sentWords = [token.text for token in sentDoc]
        text_tag["text"] = sentWords

        text_tag["pos"] = []
        text_tag["dep_fw"] = [[-1] * len(sentWords) for i in range(len(sentWords))]
        text_tag["dep_bw"] = [[-1] * len(sentWords) for i in range(len(sentWords))]
        for token in sentDoc:
            # pos:        [seq_length].       The part-of-speech tag of each word.
            text_tag["pos"].append(token.pos)
            # dep_fw:     [seq_length, seq_length].       The dependencty adjacency matrix (forward edge) of each word-pair.
            # dep_bw:     [seq_length, seq_length].       The dependencty adjacency matrix (backward edge) of each word-pair.
            if token.i >= token.head.i:
                text_tag["dep_fw"][token.i][token.head.i] = token.dep
            else:
                text_tag["dep_bw"][token.i][token.head.i] = token.dep

        # ans_ne:     [seq_length].                   The output tag og name entity of each word.
        text_tag["ans_ne"] = ["O"] * len(sentWords)
        for entity in data["entityMentions"]:
            entity_doc = nlp(self.normalize_text(entity["text"]))
            entity_list = [token.text for token in entity_doc]
            entity_idxs = self.find_all_index(sentWords, entity_list)
            for index in entity_idxs:
                if index[1] - index[0] == 1:
                    text_tag["ans_ne"][index[0]] = "S-" + entity["label"]
                elif index[1] - index[0] == 2:
                    text_tag["ans_ne"][index[0]] = "B-" + entity["label"]
                    text_tag["ans_ne"][index[1] - 1] = "E-" + entity["label"]
                elif index[1] - index[0] > 2:
                    for i in range(index[0], index[1]):
                        text_tag["ans_ne"][i] = "I-" + entity["label"]
                    text_tag["ans_ne"][index[0]] = "B-" + entity["label"]
                    text_tag["ans_ne"][index[1] - 1] = "E-" + entity["label"]

        # ans_rel:    [seq_length, seq_length].       The output relation of each word-pair.
        # relationMentions:               The gold relational triples.
        text_tag["ans_rel"] = [[1] * len(sentWords) for i in range(len(sentWords))]
        text_tag["relationMentions"] = []
        for relation in data["relationMentions"]:
            entity1_list = [token.text for token in nlp(self.normalize_text(relation["em1Text"]))]
            entity2_list = [token.text for token in nlp(self.normalize_text(relation["em2Text"]))]
            entity1_idxs = self.find_all_index(sentWords, entity1_list)
            entity2_idxs = self.find_all_index(sentWords, entity2_list)

            for en1_idx in entity1_idxs:
                for en2_idx in entity2_idxs:
                    for i in range(en1_idx[0], en1_idx[1]):
                        for j in range(en2_idx[0], en2_idx[1]):
                            text_tag["ans_rel"][i][j] = relation_ids[relation["label"]]

            relation_item = {}
            if self.normalize_text(relation["label"]) != "None":
                relation_item["label"] = relation["label"]
                relation_item["label_id"] = relation_ids[relation["label"]]
                relation_item["em1Text"] = entity1_list
                relation_item["em2Text"] = entity2_list
                text_tag["relationMentions"].append(relation_item)

        if EpochCount % 10000 == 0:
            print("Epoch ", EpochCount)

        fw.write(json.dumps(text_tag, ensure_ascii=False) + '\n')

    fw.close()
    print("Successfully transfered the file!\n")
    return

`

def normalize_text(self, text): """ Normalize the unicode string. Args: text: unicode string Return: """ `

    return unicodedata.normalize('NFKD', text).encode('ascii','ignore').decode('utf-8')

`

def find_all_index(self, sen_split, word_split): """ Find all loaction of entity in sentence. Args: sen_split: the sentence array. word_split: the entity array. Return: index_list: the list of index pairs. """

`

    start, end, offset = -1, -1, 0
    #print("sen_split: ", sen_split)
    #print("word_split: ", word_split)
    index_list = []
    while(True):
        if len(index_list) != 0:
            offset = index_list[-1][1]
        start, end = self.find_index(sen_split[offset:], word_split)
        if start == -1 and end == -1: break
        if end <= start: break
        start += offset
        end += offset
        index_list.append((start, end))
    return index_list

`

def find_index(self, sen_split, word_split): """ Find the loaction of entity in sentence. Args: sen_split: the sentence array. word_split: the entity array. Return: index1: start index index2: end index """

`

    index1 = -1
    index2 = -1
    for i in range(len(sen_split)):
        if str(sen_split[i]) == str(word_split[0]):
            flag = True
            k = i
            for j in range(len(word_split)):
                if word_split[j] != sen_split[k]:
                    flag = False
                if k < len(sen_split) - 1:
                    k+=1
            if flag:
                index1 = i
                index2 = i + len(word_split)
                break
    return index1, index2

`

Note that: (1) I do not generate wgt_ne and wgt_rel, since you can adapt it through the Args "weight" in loss function nn.CrossEntropyLoss(), you can check the document by yourself. (2) You need to reconstruct dep_fw and dep_bw because there are some too big value generated by spaCy. Map them to smaller integers from 0 for the sake of the next training step. (3) self.input_path: The original NYT dataset. self.output_path: You output file path. (4) You should install the libraries you need. (5) The original NYT dataset is available at here. (6) 感觉有用的话点个赞呗。

Can you share a complete code for me? I have bugs in using your code. I don't understand, thank you. If can, please 470294527@qq.com

JerAex commented 4 years ago

I have reproduced a dataset in JSON format with the following code, which is not guaranteed to be the same as the author's implementation.

def tag_graphrel(self, relation_id_path, isTrain = True): """ Tag the source json in NYT based on graphrel Args: isTrain: type. Train is true and test is false. Please ignore it. relation_id_path: The relation-id dict file path. Return: Write the json file into the file. [{ text: [seq_length]. The text word list. pos: [seq_length]. The part-of-speech tag of each word. dep_fw: [seq_length, seq_length]. The dependencty adjacency matrix forward edge) of each word-pair. dep_bw: [seq_length, seq_length]. The dependencty adjacency matrix (backward edge) of each word-pair. ans_ne: [seq_length]. The output tag of name entity of each word. ans_rel: [seq_length, seq_length]. The output relation of each word-pair.

wgt_ne: [seq_length]. The loss weight of name entity of each word, 1 for those contains name entity or relation, otherwise 0.

wgt_rel: [seq_length, seq_length]. The loss weight of relation of each word-pair, 1 for those contains name entity or relation, otherwise 0.

relationMentions: The gold relational triples. [{"label", "label_id", "em1Text", "em2Text"}] }] """ `

    # spacy model
    nlp = spacy.load('en')

    datas = []
    EpochCount = 0
    relation_ids = {"PAD": 0, "None": 1} # 0 for padding and 1 for None relation

    # Load the data from source json.
    with open(self.input_path, "r", encoding="utf-8") as fr:
        for line in fr.readlines():
            line = json.loads(line)
            datas.append(line)

    # Get the relation-id dict and entity-id dict.
    if relation_id_path is None:
        print("Please provide the relation_id file path!")
        exit()
    if os.path.exists(relation_id_path):
        print("There have been the relation_is_file, let's use it!")
        with open(relation_id_path, mode="r", encoding="utf-8") as f:
            for line_id, line in enumerate(f):
                relation_ids = json.loads(line)
    else:
        print("There is not the relation_id_file, let's make it according to the dataset!")
        for data in datas:
            for relation in data["relationMentions"]:
                if self.normalize_text(relation["label"]) != "None":
                    if relation["label"] not in relation_ids.keys():
                        relation_ids[relation["label"]] = len(relation_ids)
        with open(relation_id_path, mode="w", encoding="utf-8") as f:
            relation_ids_str = json.dumps(relation_ids, ensure_ascii=False)
            f.write(relation_ids_str + "\n")     

    print("The number of relations: ", len(relation_ids))
    print("Relations to id: ", relation_ids)

    fw = open(self.output_path, "w+", encoding="utf-8")

    for data in datas:
        EpochCount += 1
        text_tag = {}

        sentText = self.normalize_text(data["sentText"]).rstrip('\n').rstrip("\r")
        sentDoc = nlp(sentText)
        # text:       [seq_length].       The text word list.
        sentWords = [token.text for token in sentDoc]
        text_tag["text"] = sentWords

        text_tag["pos"] = []
        text_tag["dep_fw"] = [[-1] * len(sentWords) for i in range(len(sentWords))]
        text_tag["dep_bw"] = [[-1] * len(sentWords) for i in range(len(sentWords))]
        for token in sentDoc:
            # pos:        [seq_length].       The part-of-speech tag of each word.
            text_tag["pos"].append(token.pos)
            # dep_fw:     [seq_length, seq_length].       The dependencty adjacency matrix (forward edge) of each word-pair.
            # dep_bw:     [seq_length, seq_length].       The dependencty adjacency matrix (backward edge) of each word-pair.
            if token.i >= token.head.i:
                text_tag["dep_fw"][token.i][token.head.i] = token.dep
            else:
                text_tag["dep_bw"][token.i][token.head.i] = token.dep

        # ans_ne:     [seq_length].                   The output tag og name entity of each word.
        text_tag["ans_ne"] = ["O"] * len(sentWords)
        for entity in data["entityMentions"]:
            entity_doc = nlp(self.normalize_text(entity["text"]))
            entity_list = [token.text for token in entity_doc]
            entity_idxs = self.find_all_index(sentWords, entity_list)
            for index in entity_idxs:
                if index[1] - index[0] == 1:
                    text_tag["ans_ne"][index[0]] = "S-" + entity["label"]
                elif index[1] - index[0] == 2:
                    text_tag["ans_ne"][index[0]] = "B-" + entity["label"]
                    text_tag["ans_ne"][index[1] - 1] = "E-" + entity["label"]
                elif index[1] - index[0] > 2:
                    for i in range(index[0], index[1]):
                        text_tag["ans_ne"][i] = "I-" + entity["label"]
                    text_tag["ans_ne"][index[0]] = "B-" + entity["label"]
                    text_tag["ans_ne"][index[1] - 1] = "E-" + entity["label"]

        # ans_rel:    [seq_length, seq_length].       The output relation of each word-pair.
        # relationMentions:               The gold relational triples.
        text_tag["ans_rel"] = [[1] * len(sentWords) for i in range(len(sentWords))]
        text_tag["relationMentions"] = []
        for relation in data["relationMentions"]:
            entity1_list = [token.text for token in nlp(self.normalize_text(relation["em1Text"]))]
            entity2_list = [token.text for token in nlp(self.normalize_text(relation["em2Text"]))]
            entity1_idxs = self.find_all_index(sentWords, entity1_list)
            entity2_idxs = self.find_all_index(sentWords, entity2_list)

            for en1_idx in entity1_idxs:
                for en2_idx in entity2_idxs:
                    for i in range(en1_idx[0], en1_idx[1]):
                        for j in range(en2_idx[0], en2_idx[1]):
                            text_tag["ans_rel"][i][j] = relation_ids[relation["label"]]

            relation_item = {}
            if self.normalize_text(relation["label"]) != "None":
                relation_item["label"] = relation["label"]
                relation_item["label_id"] = relation_ids[relation["label"]]
                relation_item["em1Text"] = entity1_list
                relation_item["em2Text"] = entity2_list
                text_tag["relationMentions"].append(relation_item)

        if EpochCount % 10000 == 0:
            print("Epoch ", EpochCount)

        fw.write(json.dumps(text_tag, ensure_ascii=False) + '\n')

    fw.close()
    print("Successfully transfered the file!\n")
    return

`

def normalize_text(self, text): """ Normalize the unicode string. Args: text: unicode string Return: """ `

    return unicodedata.normalize('NFKD', text).encode('ascii','ignore').decode('utf-8')

`

def find_all_index(self, sen_split, word_split): """ Find all loaction of entity in sentence. Args: sen_split: the sentence array. word_split: the entity array. Return: index_list: the list of index pairs. """

`

    start, end, offset = -1, -1, 0
    #print("sen_split: ", sen_split)
    #print("word_split: ", word_split)
    index_list = []
    while(True):
        if len(index_list) != 0:
            offset = index_list[-1][1]
        start, end = self.find_index(sen_split[offset:], word_split)
        if start == -1 and end == -1: break
        if end <= start: break
        start += offset
        end += offset
        index_list.append((start, end))
    return index_list

`

def find_index(self, sen_split, word_split): """ Find the loaction of entity in sentence. Args: sen_split: the sentence array. word_split: the entity array. Return: index1: start index index2: end index """

`

    index1 = -1
    index2 = -1
    for i in range(len(sen_split)):
        if str(sen_split[i]) == str(word_split[0]):
            flag = True
            k = i
            for j in range(len(word_split)):
                if word_split[j] != sen_split[k]:
                    flag = False
                if k < len(sen_split) - 1:
                    k+=1
            if flag:
                index1 = i
                index2 = i + len(word_split)
                break
    return index1, index2

`

Note that: (1) I do not generate wgt_ne and wgt_rel, since you can adapt it through the Args "weight" in loss function nn.CrossEntropyLoss(), you can check the document by yourself. (2) You need to reconstruct dep_fw and dep_bw because there are some too big value generated by spaCy. Map them to smaller integers from 0 for the sake of the next training step. (3) self.input_path: The original NYT dataset. self.output_path: You output file path. (4) You should install the libraries you need. (5) The original NYT dataset is available at here. (6) 感觉有用的话点个赞呗。

能否发一下这个预处理的完整代码,我的邮箱1979453046@qq.com 非常感谢!

LuoXukun commented 3 years ago

这就是我预处理的全部代码,经验证是可以跑通的,不行的话请自行debug。

niuweicai commented 2 years ago

还要吗

---Original--- From: "Huang @.> Date: Sat, Jan 29, 2022 18:10 PM To: @.>; Cc: @.**@.>; Subject: Re: [tsujuifu/pytorch_graph-rel] code details (#6)

@.***?谢谢!

— Reply to this email directly, view it on GitHub, or unsubscribe. Triage notifications on the go with GitHub Mobile for iOS or Android. You are receiving this because you commented.Message ID: @.***>

fancy999 commented 2 years ago

还要吗 ---Original--- From: "Huang @.> Date: Sat, Jan 29, 2022 18:10 PM To: @.>; Cc: @.**@.>; Subject: Re: [tsujuifu/pytorch_graph-rel] code details (#6) @.?谢谢! — Reply to this email directly, view it on GitHub, or unsubscribe. Triage notifications on the go with GitHub Mobile for iOS or Android. You are receiving this because you commented.Message ID: @.>

嗯嗯