Questions - Githubissues

evangeliazve commented 1 year ago

Hi again,

Thank you again for this solution. I would be happy to share your work in way you wish. Pleas let me know.

I have the following questions :

Do you have the code that allows the usage of the RE Model Predictions ?
How can I take the output from the NER Model Inference and Feed this to the RE Model for training ?

Best Regards, Evangelia Zve

sujitpal commented 1 year ago

Answers to your question:

Not directly, but you can adapt the Evaluation code (for example cell 29 in 05a-nyt-re-bert.ipynb) and the Preprocessing code to take a single sentence with entities, embed the PHRASE spans and encode them using the tokenizer into a batch size of 1.
You can do batch inference on the NER model by passing in sentences (again you would need to adapt the code, I didn't do it here). The output would be 2 or more entities detected in each sentence. You then generate multiple instances of the training sentence with exactly 2 entities in each. For example if your NER predicted entities (A, B, C) in the sentence, your training set for RE would be the sentence with entity pairs (A, B), (B, C), (A, C).

evangeliazve commented 1 year ago

Hello,

Thanks for your quick reply. It's clear. I have one more question. Do you think is possible to save the Relation Extraction Model you proposed in Hugging Face ?

Best, Evangelia Zve

sujitpal commented 1 year ago

I haven't tried it myself, but it should be possible with push_to_hub as detailed in this page -- https://huggingface.co/docs/transformers/v4.15.0/model_sharing

evangeliazve commented 1 year ago

Thank you very much for your help

evangeliazve commented 1 year ago

Hello,

"Not directly, but you can adapt the Evaluation code (for example cell 29 in 05a-nyt-re-bert.ipynb) and the Preprocessing code to take a single sentence with entities, embed the PHRASE spans and encode them using the tokenizer into a batch size of 1."

Regarding this, I cannot handle the preprocessing part as I need to define relationship to create the span_idxs Should I create span ids for every possible relationship and then predict if it is actually a relationship or not ?

Thanks again

darebfh commented 1 year ago

Hello! I am also very interested in performing NER based on 🤗Transformers. I managed to adapt the code of 05-nyt-re-bert for inference (see below). You can then retrieve the name of the relation class using id2label() with the predicted output. @evangeliazve the span_idxs array does not contain relationships but simply the positions of the two spans containing the named entities.

model = BertForRelationExtraction.from_pretrained(os.path.join(MODEL_DIR, "ckpt-{:d}".format(epoch)), len(valid_relations))

input_object = {
    "tokens": ["But", "that", "spasm", "of", "irritation", "by", "a", "master", "intimidator", "was", "minor", "compared", "with", "what", "<S:PER>", "Bobby", "Fischer", "</S:PER>", ",", "the", "erratic", "former", "world", "chess", "champion", ",", "dished", "out", "in", "March", "at", "a", "news", "conference", "in", "Reykjavik", ",", "<O:LOC>", "Iceland", "</O:LOC>", "."]
}

def encode_data_inference(examples):
  tokenized_inputs = tokenizer(examples["tokens"],
                               is_split_into_words=True,
                               truncation=True,
                               return_tensors ="pt") # this is needed because for training, conversion to tensors is performed using the DataLoader
  span_idxs = []
  for input_id in tokenized_inputs.input_ids:
    tokens = tokenizer.convert_ids_to_tokens(input_id)
    print(tokens)
    span_idxs.append([
      [idx for idx, token in enumerate(tokens) if token.startswith("<S:")][0],
      [idx for idx, token in enumerate(tokens) if token.startswith("</S:")][0],
      [idx for idx, token in enumerate(tokens) if token.startswith("<O:")][0],
      [idx for idx, token in enumerate(tokens) if token.startswith("</O:")][0]
    ])
  tokenized_inputs["span_idxs"] = torch.from_numpy(np.array(span_idxs)) # manually create a tensor containing the span ids
  return tokenized_inputs

input = encode_data_inference(input_object)

with torch.no_grad():
    logits = model(**input).logits
    print(logits)
    predictions = torch.argmax(outputs.logits, dim=-1).cpu().numpy()
print(predictions)

Output: ['[CLS]', 'But', 'that', 'spa', '##sm', 'of', 'irritation', 'by', 'a', 'master', 'in', '##ti', '##mi', '##da', '##tor', 'was', 'minor', 'compared', 'with', 'what', '', 'Bobby', 'Fischer', '</S:PER>', ',', 'the', 'erratic', 'former', 'world', 'chess', 'champion', ',', 'dish', '##ed', 'out', 'in', 'March', 'at', 'a', 'news', 'conference', 'in', 'Rey', '##k', '##ja', '##vik', ',', '', 'Iceland', '</O:LOC>', '.', '[SEP]'] tensor([[-4.2859, -0.3964, -0.4866, -1.7542, 6.7569, -5.2384, 0.4867, 2.8524, 2.5765]]) [6]

sujitpal commented 1 year ago

Hello,

"Not directly, but you can adapt the Evaluation code (for example cell 29 in 05a-nyt-re-bert.ipynb) and the Preprocessing code to take a single sentence with entities, embed the PHRASE spans and encode them using the tokenizer into a batch size of 1."

Regarding this, I cannot handle the preprocessing part as I need to define relationship to create the span_idxs Should I create span ids for every possible relationship and then predict if it is actually a relationship or not ?

Thanks again

Sorry for the delay in responding, looks like I missed this comment. And thanks for the nice example @darebfh ! Looks like it predicted an incorrect relationship id 6 which is location/neighborhood/neighborhood_of but given that there does not seem to be anything specifically defined for (person, ?, location) maybe this is the best it could do.

sujitpal / ner-re-with-transformers-odsc2022

Questions #2