Closed fatihbeyhan closed 2 years ago
Hi @fatihbeyhan,
We are sorry but the entity disambiguation example is currently incompatible with the pretraining instruction. We are now working on developing a new entity disambiguation code, and will release it in the near future.
Hi @ikuyamad,
Thank you for your reply. I am working on implementing LUKE Entity Disamb. system for Turkish as a part of my thesis. I have seen an example of building this system from one of the previous issues #115. Would that path still work? If not, can you tell me approximately when will ED instructions be ready?
Thank you for this great work.
The pretraining code should work correctly. However, the entity disambiguation example is not currently compatible with the LUKE model in the transformers library. Therefore, you need to create a model archive file using the create-model-archive function and specify it to the --model-file
argument.
The dataset that we used in our experiments is equivalent to the one publicized here. You need to convert your dataset to the format of the original dataset.
Hi @ikuyamada,
Thank you for your response. As I understand, I should use create-model-archive function to create a .tar.gz file which will be compatible with the entity disambiguation example code. To use this function, there are three arguments. model_file, out_file and compress. This model_file should come from the pretaining right? But I am not sure which code you meant by "The pretraining code should work correctly." The one you recently shared for LUKE pretraining or the one in the #115 issue?
Hi, do you have any updates on instructions for Entity Disambugiation pretraining for a different language?
Hi @fatihbeyhan, I am sorry for the delayed reply! I am working on the new entity disambiguation code and pretraining instructions in this branch. This branch is work-in-progress, so I will notify you when I complete the work.
@fatihbeyhan I have completed the work and new pretraining instruction for entity disambiguation is available here.
@ikuyamada Thank you for your time and work! As I've said, I am trying to follow the instructions to train LUKE (entity disambiguation) for Turkish. However, I believe there is an error. When I run the following command:
python luke/cli.py build-dump-db trwiki-latest-pages-articles.xml.bz2 trwiki.db
It parses the Turkish Wikipedia dump without an error. Then when I try to run the next step:
python examples/entity_disambiguation/scripts/create_candidate_data.py --db-file=examples/entity_disambiguation/trwiki-20221001.db --dataset-dir=examples/entity_disambiguation/turkish-entity-disambiguation-dataset/ --output-file=candidates.txt
I get this error:
Traceback (most recent call last):
File "examples/entity_disambiguation/scripts/create_candidate_data.py", line 33, in <module>
create_candidate_data()
File "/home/fatihbeyhan/anaconda3/envs/luke-0.2.0/lib/python3.8/site-packages/click/core.py", line 1130, in __call__
return self.main(*args, **kwargs)
File "/home/fatihbeyhan/anaconda3/envs/luke-0.2.0/lib/python3.8/site-packages/click/core.py", line 1055, in main
rv = self.invoke(ctx)
File "/home/fatihbeyhan/anaconda3/envs/luke-0.2.0/lib/python3.8/site-packages/click/core.py", line 1404, in invoke
return ctx.invoke(self.callback, **ctx.params)
File "/home/fatihbeyhan/anaconda3/envs/luke-0.2.0/lib/python3.8/site-packages/click/core.py", line 760, in invoke
return __callback(*args, **kwargs)
File "examples/entity_disambiguation/scripts/create_candidate_data.py", line 18, in create_candidate_data
dataset = load_dataset(dataset_dir)
File "/home/fatihbeyhan/Research/Thesis/Modelling/luke/examples/entity_disambiguation/scripts/dataset.py", line 90, in load_dataset
with open(os.path.join(dataset_dir, "persons.txt")) as f:
FileNotFoundError: [Errno 2] No such file or directory: 'examples/entity_disambiguation/turkish-entity-disambiguation-dataset/persons.txt'
From which I understand that the previous step (build-dump-db) should've output some files. I cannot see person.txt
or some other output file except the .db file.
Is this a bug or am I missing something? Thank you for your time!
Hi @fatihbeyhan,
Thanks for your reply! I am looking forward to the Turkish version of LUKE!!😍
create_candidate_data.py
generates a text file that contains a list of entities included as candidates in the English entity disambiguation dataset. The entity disambiguation dataset is based on the one proposed in Ganea and Hofmann, 2017, and persons.txt
is the file that contains English names used in the heuristic coreference resolution step.
In our entity disambiguation experiments, we built the entity vocabulary using only entity candidates contained in the dataset, so the list of the entity candidates is used to build the entity vocabulary.
The output of the script is a text file simply containing entity titles separated by new lines, so you can easily create the vocabulary of Turkish entities based on your needs.
Thank you again for your help. I am trying to evaluate the model I trained (not sure if I did it correctly since there were many modifications I did to make it work) on a custom test set from the Wikipedia dump I used for pertaining. However, the evaluation script for the entity disambiguation seems to be designed for the test sets used in your paper. Any recommendations?
I have been looking for the setting to give train, validation, and test sets that I created in order to have a proper comparison but I believe there is no such option. Am I wrong?
By the way, I am trying to pretrain Luke for only the 2000 most frequent entities in Turkish Wikipedia.
Thanks for your reply!
However, the evaluation script for the entity disambiguation seems to be designed for the test sets used in your paper.
The entity disambiguation code is written only for the datasets used in our experiments. However, if I correctly remember, the code except for the dataset-specific ones (e.g., EntityDisambiguationDataset
class) can be used for other datasets.
I close this issue as there is no recent activity.
This issue was for pretraining for a different language and testing of 'LUKE for ED'. However, the code provided is designed for a few sets of datasets in English. Let's keep this issue open. Unless we have the option to use a Wikipedia dump in any language for starting training and testing on the same Wikipedia dump. Thank you for your help!
Hi,
I have pretrained an Entity Disambiguation model with the recent instructions of pretraining Luke. From the instructions shared here and in another issue #115 , I was able to perform two-step pretraining for Turkish.
I am sharing the commands and config files I have used in order to make sure that nothing was non-logical. I have run the following code for the first stage with the configuration setup below; Code:
Config:
And the second line of code with the configuration setup is below; Command:
Config:
I guess now I have a model that can perform Entity Disambiguation? The question is that I cannot see any clear example for performing or evaluating a pretrained model on Entity Disambiguation. How the data should be formatted? How should I call the model in order to make predictions?
Thank you for your time.