sujitpal / ner-re-with-transformers-odsc2022

Building NER and RE components using HuggingFace Transformers
Apache License 2.0
49 stars 29 forks source link

Questions #2

Open evangeliazve opened 1 year ago

evangeliazve commented 1 year ago

Hi again,

Thank you again for this solution. I would be happy to share your work in way you wish. Pleas let me know.

I have the following questions :

Best Regards, Evangelia Zve

sujitpal commented 1 year ago

Answers to your question:

evangeliazve commented 1 year ago

Hello,

Thanks for your quick reply. It's clear. I have one more question. Do you think is possible to save the Relation Extraction Model you proposed in Hugging Face ?

Best, Evangelia Zve

sujitpal commented 1 year ago

I haven't tried it myself, but it should be possible with push_to_hub as detailed in this page -- https://huggingface.co/docs/transformers/v4.15.0/model_sharing

evangeliazve commented 1 year ago

Thank you very much for your help

evangeliazve commented 1 year ago

Hello,

"Not directly, but you can adapt the Evaluation code (for example cell 29 in 05a-nyt-re-bert.ipynb) and the Preprocessing code to take a single sentence with entities, embed the PHRASE spans and encode them using the tokenizer into a batch size of 1."

Regarding this, I cannot handle the preprocessing part as I need to define relationship to create the span_idxs Should I create span ids for every possible relationship and then predict if it is actually a relationship or not ?

Thanks again

darebfh commented 1 year ago

Hello! I am also very interested in performing NER based on 🤗Transformers. I managed to adapt the code of 05-nyt-re-bert for inference (see below). You can then retrieve the name of the relation class using id2label() with the predicted output. @evangeliazve the span_idxs array does not contain relationships but simply the positions of the two spans containing the named entities.

model = BertForRelationExtraction.from_pretrained(os.path.join(MODEL_DIR, "ckpt-{:d}".format(epoch)), len(valid_relations))

input_object = {
    "tokens": ["But", "that", "spasm", "of", "irritation", "by", "a", "master", "intimidator", "was", "minor", "compared", "with", "what", "<S:PER>", "Bobby", "Fischer", "</S:PER>", ",", "the", "erratic", "former", "world", "chess", "champion", ",", "dished", "out", "in", "March", "at", "a", "news", "conference", "in", "Reykjavik", ",", "<O:LOC>", "Iceland", "</O:LOC>", "."]
}

def encode_data_inference(examples):
  tokenized_inputs = tokenizer(examples["tokens"],
                               is_split_into_words=True,
                               truncation=True,
                               return_tensors ="pt") # this is needed because for training, conversion to tensors is performed using the DataLoader
  span_idxs = []
  for input_id in tokenized_inputs.input_ids:
    tokens = tokenizer.convert_ids_to_tokens(input_id)
    print(tokens)
    span_idxs.append([
      [idx for idx, token in enumerate(tokens) if token.startswith("<S:")][0],
      [idx for idx, token in enumerate(tokens) if token.startswith("</S:")][0],
      [idx for idx, token in enumerate(tokens) if token.startswith("<O:")][0],
      [idx for idx, token in enumerate(tokens) if token.startswith("</O:")][0]
    ])
  tokenized_inputs["span_idxs"] = torch.from_numpy(np.array(span_idxs)) # manually create a tensor containing the span ids
  return tokenized_inputs

input = encode_data_inference(input_object)

with torch.no_grad():
    logits = model(**input).logits
    print(logits)
    predictions = torch.argmax(outputs.logits, dim=-1).cpu().numpy()
print(predictions)

Output: ['[CLS]', 'But', 'that', 'spa', '##sm', 'of', 'irritation', 'by', 'a', 'master', 'in', '##ti', '##mi', '##da', '##tor', 'was', 'minor', 'compared', 'with', 'what', '', 'Bobby', 'Fischer', '</S:PER>', ',', 'the', 'erratic', 'former', 'world', 'chess', 'champion', ',', 'dish', '##ed', 'out', 'in', 'March', 'at', 'a', 'news', 'conference', 'in', 'Rey', '##k', '##ja', '##vik', ',', '', 'Iceland', '</O:LOC>', '.', '[SEP]'] tensor([[-4.2859, -0.3964, -0.4866, -1.7542, 6.7569, -5.2384, 0.4867, 2.8524, 2.5765]]) [6]

sujitpal commented 1 year ago

Hello,

"Not directly, but you can adapt the Evaluation code (for example cell 29 in 05a-nyt-re-bert.ipynb) and the Preprocessing code to take a single sentence with entities, embed the PHRASE spans and encode them using the tokenizer into a batch size of 1."

Regarding this, I cannot handle the preprocessing part as I need to define relationship to create the span_idxs Should I create span ids for every possible relationship and then predict if it is actually a relationship or not ?

Thanks again

Sorry for the delay in responding, looks like I missed this comment. And thanks for the nice example @darebfh ! Looks like it predicted an incorrect relationship id 6 which is location/neighborhood/neighborhood_of but given that there does not seem to be anything specifically defined for (person, ?, location) maybe this is the best it could do.