studio-ousia / luke

LUKE -- Language Understanding with Knowledge-based Embeddings
Apache License 2.0
705 stars 101 forks source link

Fine-tuning Custom Relations on Non-Wikipedia text #136

Closed bvnagaraju closed 2 years ago

bvnagaraju commented 2 years ago

Hello,

I would like to use the fine-tuning procedure(relation classification) and achieve the below tasks.

  1. Add new relation classes as part of multi-class
  2. fine-tune on custom dataset let's say "MyEnglishNewsDataset"
  3. Add new entity types part of NER.

Are there any recommended steps one should follow for fine-tunning on custom classes and text ? Maybe if there is documentation it will be beneficial?

Is the current fine-tuning procedure targeted only for TACRED, KBP37, and RELX?

will there be any issues with entity names(entity vocabulary) if entity names on my dataset differ from the current entity vocabulary which looks like derived from Wikipedia?

ryokan0123 commented 2 years ago

Hi, @bvnagaraju.

Are there any recommended steps one should follow for fine-tunning on custom classes and text ? Maybe if there is documentation it will be beneficial?

To adapt our code to new datasets or add new classes, you need to modify the DatasetReader class. https://github.com/studio-ousia/luke/blob/master/examples/relation_classification/reader.py https://github.com/studio-ousia/luke/blob/master/examples/ner/reader.py

This guide to the allennlp library should help you write allennlp-based code. https://guide.allennlp.org/

Is the current fine-tuning procedure targeted only for TACRED, KBP37, and RELX?

The current code only supports only these datasets, but should be able to handle other datasets by modifying the DatasetReader code.

will there be any issues with entity names(entity vocabulary) if entity names on my dataset differ from the current entity vocabulary which looks like derived from Wikipedia?

If you want to use entities, you should be careful about checking if they exist in our entity vocabulary. If their name differs from the ones in the entity vocabulary, the model cannot correctly use them. Maybe you need some way to match the entity names.

bvnagaraju commented 2 years ago

Hi @Ryou0634

does the model rely on other features(like word embeddings) if the entity name doesn't match in the entity vocabulary? how may the model behave in case the entity name is not in the entity vocabulary? is there a way one can update entity vocabulary or build embeddings/features for new entities from the custom text(non-Wikipedia)?

ryokan0123 commented 2 years ago

does the model rely on other features(like word embeddings) if the entity name doesn't match in the entity vocabulary? Which model do you have in mind?

You can just use the LUKE encoder with word inputs, which does not require entities. With this model, if the entity name doesn't have a match in the entity vocabulary, it just uses word features. With the entity-based model, you should explicitly handle out-of-vocabulary entities (see below).

how may the model behave in case the entity name is not in the entity vocabulary?

If you want to use entity features, you should explicitly detect entities and convert them to entity ids in DatasetReader. If the entity name is not found in the entity vocabulary, you just cannot feed the entity ids to the model.

For clarification, our model for relation extraction does not use any entity embeddings except for [MASK] tokens (head and tail tokens).