studio-ousia / luke

LUKE -- Language Understanding with Knowledge-based Embeddings
Apache License 2.0
707 stars 101 forks source link

Pretraining for a Different Language and Testing Pretrained Model for Entity Disambiguation #126

Closed fatihbeyhan closed 2 years ago

fatihbeyhan commented 2 years ago

Hi,

I have pretrained an Entity Disambiguation model with the recent instructions of pretraining Luke. From the instructions shared here and in another issue #115 , I was able to perform two-step pretraining for Turkish.

I am sharing the commands and config files I have used in order to make sure that nothing was non-logical. I have run the following code for the first stage with the configuration setup below; Code:

deepspeed \
--num_gpus=6 luke/pretraining/train.py \
--output-dir=training_on_turkish/luke-bert-base-turkish-first-stage \
--deepspeed-config-file=pretraining_config/luke_base_stage1.json \
--dataset-dir=training_on_turkish/tr_pretraining_dataset \
--bert-model-name=dbmdz/bert-base-turkish-uncased  \
--num-epochs=1 \
--fix-bert-weights \
--masked-entity-prob=0.30 \
--masked-lm-prob=0

Config:

{
  "train_batch_size": 24,
  "train_micro_batch_size_per_gpu": 4,
  "optimizer": {
    "type": "AdamW",
    "params": {
      "lr": 5e-4,
      "betas": [0.9, 0.999],
      "eps": 1e-6,
      "weight_decay": 0.01,
      "bias_correction": false
    }
  },
  "scheduler": {
    "type": "WarmupDecayLR",
    "params": {
      "warmup_min_lr": 0,
      "warmup_max_lr": 5e-4,
      "warmup_num_steps": 1000,
      "total_num_steps": 192796,
      "warmup_type": "linear"
    }
  },
  "gradient_clipping": 10000.0
}

And the second line of code with the configuration setup is below; Command:

deepspeed
--num_gpus=6 luke/pretraining/train.py \ 
--output-dir=training_on_turkish/luke-bert-base-turkish-second-stage \
--deepspeed-config-file=pretraining_config/luke_base_stage2.json \
--dataset-dir=training_on_turkish/tr_pretraining_dataset/ \
--bert-model-name=dbmdz/bert-base-turkish-uncased \
--num-epochs=5 \
--reset-optimization-states \
--resume-checkpoint-id=training_on_turkish/luke-bert-base-turkish-first-stage/checkpoints/epoch1/ \
--masked-entity-prob=0.30 \
--masked-lm-prob=0

Config:

{
  "train_batch_size": 24,
  "train_micro_batch_size_per_gpu": 4,
  "optimizer": {
    "type": "AdamW",
    "params": {
      "lr": 1e-5,
      "betas": [0.9, 0.999],
      "eps": 1e-6,
      "weight_decay": 0.01,
      "bias_correction": false
    }
  },
  "scheduler": {
    "type": "WarmupDecayLR",
    "params": {
      "warmup_min_lr": 0,
      "warmup_max_lr": 1e-5,
      "warmup_num_steps": 2500,
      "total_num_steps": 98168,
      "warmup_type": "linear"
    }
  },
  "gradient_clipping": 10000.0
}

I guess now I have a model that can perform Entity Disambiguation? The question is that I cannot see any clear example for performing or evaluating a pretrained model on Entity Disambiguation. How the data should be formatted? How should I call the model in order to make predictions?

Thank you for your time.

ikuyamada commented 2 years ago

Hi @fatihbeyhan,

We are sorry but the entity disambiguation example is currently incompatible with the pretraining instruction. We are now working on developing a new entity disambiguation code, and will release it in the near future.

fatihbeyhan commented 2 years ago

Hi @ikuyamad,

Thank you for your reply. I am working on implementing LUKE Entity Disamb. system for Turkish as a part of my thesis. I have seen an example of building this system from one of the previous issues #115. Would that path still work? If not, can you tell me approximately when will ED instructions be ready?

Thank you for this great work.

ikuyamada commented 2 years ago

The pretraining code should work correctly. However, the entity disambiguation example is not currently compatible with the LUKE model in the transformers library. Therefore, you need to create a model archive file using the create-model-archive function and specify it to the --model-file argument.

The dataset that we used in our experiments is equivalent to the one publicized here. You need to convert your dataset to the format of the original dataset.

fatihbeyhan commented 2 years ago

Hi @ikuyamada,

Thank you for your response. As I understand, I should use create-model-archive function to create a .tar.gz file which will be compatible with the entity disambiguation example code. To use this function, there are three arguments. model_file, out_file and compress. This model_file should come from the pretaining right? But I am not sure which code you meant by "The pretraining code should work correctly." The one you recently shared for LUKE pretraining or the one in the #115 issue?

fatihbeyhan commented 2 years ago

Hi, do you have any updates on instructions for Entity Disambugiation pretraining for a different language?

ikuyamada commented 2 years ago

Hi @fatihbeyhan, I am sorry for the delayed reply! I am working on the new entity disambiguation code and pretraining instructions in this branch. This branch is work-in-progress, so I will notify you when I complete the work.

ikuyamada commented 2 years ago

@fatihbeyhan I have completed the work and new pretraining instruction for entity disambiguation is available here.

fatihbeyhan commented 2 years ago

@ikuyamada Thank you for your time and work! As I've said, I am trying to follow the instructions to train LUKE (entity disambiguation) for Turkish. However, I believe there is an error. When I run the following command:

python luke/cli.py build-dump-db  trwiki-latest-pages-articles.xml.bz2 trwiki.db

It parses the Turkish Wikipedia dump without an error. Then when I try to run the next step:

python examples/entity_disambiguation/scripts/create_candidate_data.py --db-file=examples/entity_disambiguation/trwiki-20221001.db   --dataset-dir=examples/entity_disambiguation/turkish-entity-disambiguation-dataset/ --output-file=candidates.txt

I get this error:

Traceback (most recent call last):
  File "examples/entity_disambiguation/scripts/create_candidate_data.py", line 33, in <module>
    create_candidate_data()
  File "/home/fatihbeyhan/anaconda3/envs/luke-0.2.0/lib/python3.8/site-packages/click/core.py", line 1130, in __call__
    return self.main(*args, **kwargs)
  File "/home/fatihbeyhan/anaconda3/envs/luke-0.2.0/lib/python3.8/site-packages/click/core.py", line 1055, in main
    rv = self.invoke(ctx)
  File "/home/fatihbeyhan/anaconda3/envs/luke-0.2.0/lib/python3.8/site-packages/click/core.py", line 1404, in invoke
    return ctx.invoke(self.callback, **ctx.params)
  File "/home/fatihbeyhan/anaconda3/envs/luke-0.2.0/lib/python3.8/site-packages/click/core.py", line 760, in invoke
    return __callback(*args, **kwargs)
  File "examples/entity_disambiguation/scripts/create_candidate_data.py", line 18, in create_candidate_data
    dataset = load_dataset(dataset_dir)
  File "/home/fatihbeyhan/Research/Thesis/Modelling/luke/examples/entity_disambiguation/scripts/dataset.py", line 90, in load_dataset
    with open(os.path.join(dataset_dir, "persons.txt")) as f:
FileNotFoundError: [Errno 2] No such file or directory: 'examples/entity_disambiguation/turkish-entity-disambiguation-dataset/persons.txt'

From which I understand that the previous step (build-dump-db) should've output some files. I cannot see person.txt or some other output file except the .db file.

Is this a bug or am I missing something? Thank you for your time!

ikuyamada commented 2 years ago

Hi @fatihbeyhan,

Thanks for your reply! I am looking forward to the Turkish version of LUKE!!😍

create_candidate_data.py generates a text file that contains a list of entities included as candidates in the English entity disambiguation dataset. The entity disambiguation dataset is based on the one proposed in Ganea and Hofmann, 2017, and persons.txt is the file that contains English names used in the heuristic coreference resolution step.

In our entity disambiguation experiments, we built the entity vocabulary using only entity candidates contained in the dataset, so the list of the entity candidates is used to build the entity vocabulary.

The output of the script is a text file simply containing entity titles separated by new lines, so you can easily create the vocabulary of Turkish entities based on your needs.

fatihbeyhan commented 2 years ago

Thank you again for your help. I am trying to evaluate the model I trained (not sure if I did it correctly since there were many modifications I did to make it work) on a custom test set from the Wikipedia dump I used for pertaining. However, the evaluation script for the entity disambiguation seems to be designed for the test sets used in your paper. Any recommendations?

I have been looking for the setting to give train, validation, and test sets that I created in order to have a proper comparison but I believe there is no such option. Am I wrong?

By the way, I am trying to pretrain Luke for only the 2000 most frequent entities in Turkish Wikipedia.

ikuyamada commented 2 years ago

Thanks for your reply!

However, the evaluation script for the entity disambiguation seems to be designed for the test sets used in your paper.

The entity disambiguation code is written only for the datasets used in our experiments. However, if I correctly remember, the code except for the dataset-specific ones (e.g., EntityDisambiguationDataset class) can be used for other datasets.

ikuyamada commented 2 years ago

I close this issue as there is no recent activity.

fatihbeyhan commented 1 year ago

This issue was for pretraining for a different language and testing of 'LUKE for ED'. However, the code provided is designed for a few sets of datasets in English. Let's keep this issue open. Unless we have the option to use a Wikipedia dump in any language for starting training and testing on the same Wikipedia dump. Thank you for your help!