Training Neuralcoref for Dutch does not work

huggingface / neuralcoref

✨Fast Coreference Resolution in spaCy with Neural Networks

https://huggingface.co/coref/

MIT License

2.85k stars 477 forks source link

Training Neuralcoref for Dutch does not work #255

Closed EricLe-dev closed 2 years ago

EricLe-dev commented 4 years ago

Dear guys,

Firstly, thank you guys so much for this interesting work. I'm training the neuralcoref model for Dutch language using SoNar corpus, at first, I used this script to convert the MMAX format to CONLL format. After that, I trained a w2v model to prepare the static_word_embedding files. I have a few questions that I could not answer myself and I could not also find anywhere else.

I don't know what tuned_word_embedding files are, whenever I ran the conllparser.py, it just complained about missing those files. Looking deeper to the original tuned_word_embedding, I could see that it is similar to the static_word_embeddings, however, there are words that appear in both static and tuned word embeddings, and there are words that only appear in tuned_word_embeddings. For this reason, I just used exactly the same word embeddings file for both static and tuned. It seemed to work (at least not throw any complaint but I'm not sure if it work or not).
I have no idea how you constructed the MISSING and the UNK tokens in those static/tuned word embeddings.
When I run the train code, it ran quite well at first but then display this error (I think it's from PERL):

I came across many topics as well as posting questions on many threads, however I still got no help or guidance. Thank you so much for any help that any of you can provide.

With best regards, Eric

chieter commented 4 years ago

Hey, there I think I can help you with some of your questions. Or at least I can tell you what I did.

Concerning the static and tuned embeddings: I did the same thing and used the same ones for training.
For the missing and unk tokens I just added an zero vectors. I do not know however if this introduces any unwanted information into the embeddings. As far as I understand it, the tuned embedding is a byproduct of training the neuralcoref model, as information is backpropagated to the embedding layer. Again, I'm not sure if this is correct, but it is my current understanding.
Your Perl error is the result of perl not finding the script used for scoring the coreferences. I fixed it by adding the following lines in the BEGIN-block: use File::Basename; use File::Spec; use lib File::Spec->catdir(File::Basename::dirname(File::Spec->rel2abs($0)),'scorer', 'lib'); I hope this helps.

I also found, that using the model after training comes with some further work, as the model loaded from the neuralcoref cache folder is in a different format. I'm currently investigating how to get this to work. And you should also keep in mind, that your results are really reliant on your mention extraction (as noted in the training instructions).

EricLe-dev commented 4 years ago

@chieter Thank you so much for sharing the details of your work. I'm following your advice and it fixed the problem regarding PERL. I am also adding blank vectors to MISSING and UNK tokens.

You are right on the fact that using the model after training requires further work as I already saw people posting that question to StackOverflow and seem not to get the proper answer. I will share here what I find when I'm at that stage.

chieter commented 4 years ago

To get your work started here is what I've been doing:

Install the alpha version of thinc, as this version includes the PyTorchWrapper
Create an Instance of the model in a python shell. I basically used the same instantiation that is used in train/dataset.py: from neuralcoref.train.model import Model model = Model(len(voc), SIZE_EMBEDDING, Args.h1, Args.h2, Args.h3, SIZE_PAIR_IN, SIZE_SINGLE_IN) Adjust the parameters according to the ones you used during traininig.
Load the model checkpoint you want to use: model.load_state_dict(torch.load(path_to_checkpoint_file)) You can find this in train/learn.py.
If everything worked so far you should see a message, that all keys could be matched and if you call the model, you should see different parts of the net labeled (word_embeds), (drop), (pair_top) and (single_top).
As far as I could tell the (pair_top) and (single_top) parts are used in the model that is loaded from neuralcoref cache. I think to avoid exporting the dropout layers used in training you need to change the model to evaluation mode using model.eval() 6.You can now use the PyTorchWrapper from the thinc alpha. from thinc.api import PyTorchWrapper pair_top_model = PyTorchWrapper(model.pair_top) single_top_model = PyTorchWrapper(model.single_top)
These models can now be written to disk using: f = open('trained_pair_top_model', 'wb') f.write(pair_top_model.to_bytes()) f.close() You can save the single_top model in the same manner.

I hope this is helpful to you @EricLe-dev . There are some unsolved problems however. The model loaded from cache has two folders included static_vectors and tuned_vectors. I'm not yet sure if these can be extracted from the model loaded in step 3. Maybe @thomwolf or @svlandeg could elaborate on that. It would also be useful to know if this is even the right approach. Take care everyone and thanks for your work <3

EricLe-dev commented 4 years ago

@chieter Thank you so much for your detailed works. I think the in the point 2 you mentioned that you initialized the model with: model = Model(len(voc), SIZE_EMBEDDING, Args.h1, Args.h2, Args.h3, SIZE_PAIR_IN, SIZE_SINGLE_IN) I think that makes sense because thanks to the fix that you proposed, I was able to get the PERL scorer working, however, I got this error:

I guess the model initialized was not properly align with my input data, here are my configuration: args.h1 = 1000 args.h2 = 500 args.h3 = 500 SIZE_EMBEDDING = 320

Other parameters of SIZE were used as default in utils.py. Am I doing anything wrong here? I guess I miss-calculated those values. How did you calculate the SIZE_PAIR_IN and the SIZE_SINGLE_IN?. In addition, I wrote an email to @thomwolf and luckily got his reply, as he stated, @svlandeg is the main maintainer of this project.

To get your work started here is what I've been doing:

Install the alpha version of thinc, as this version includes the PyTorchWrapper

Create an Instance of the model in a python shell. I basically used the same instantiation that is used in train/dataset.py: from neuralcoref.train.model import Model model = Model(len(voc), SIZE_EMBEDDING, Args.h1, Args.h2, Args.h3, SIZE_PAIR_IN, SIZE_SINGLE_IN) Adjust the parameters according to the ones you used during traininig.

Load the model checkpoint you want to use: model.load_state_dict(torch.load(path_to_checkpoint_file)) You can find this in train/learn.py.

If everything worked so far you should see a message, that all keys could be matched and if you call the model, you should see different parts of the net labeled (word_embeds), (drop), (pair_top) and (single_top).

As far as I could tell the (pair_top) and (single_top) parts are used in the model that is loaded from neuralcoref cache. I think to avoid exporting the dropout layers used in training you need to change the model to evaluation mode using model.eval() 6.You can now use the PyTorchWrapper from the thinc alpha. from thinc.api import PyTorchWrapper pair_top_model = PyTorchWrapper(model.pair_top) single_top_model = PyTorchWrapper(model.single_top)

These models can now be written to disk using: f = open('trained_pair_top_model', 'wb') f.write(pair_top_model.to_bytes()) f.close() You can save the single_top model in the same manner.

I hope this is helpful to you @EricLe-dev . There are some unsolved problems however. The model loaded from cache has two folders included static_vectors and tuned_vectors. I'm not yet sure if these can be extracted from the model loaded in step 3. Maybe @thomwolf or @svlandeg could elaborate on that. It would also be useful to know if this is even the right approach. Take care everyone and thanks for your work <3

chieter commented 4 years ago

The parameters SIZE_PAIR_IN and SIZE_SINGLE_IN are calculated according to the formulas in utils.py. These are dependent on SIZE_EMBEDDING as well, so if you changed that you have to calculate them according to your embedding size.

EricLe-dev commented 4 years ago

My SIZE_EMBEDDING was 320 so I modified it. For the other parameters like SIZE_SPAN, SIZE_WORD, SIZE_FP, etc., I just use the default values, should they be modified accordingly?

chieter commented 4 years ago

You can leave them as they are, but you need all parameters that you need for model instantiation to have the values used during training.

EricLe-dev commented 4 years ago

According to the picture of the size mismatch error, I could see that my input size data was: [860x4184] while the model expects something [674x1000]. After some debugs, I realize that the first number (860) was the SIZE_SINGLE_IN, for this reason, I guess the second number (4184) was the SIZE_PAIR_IN so I just directly initialized my model like this: model = Model(len(voc), SIZE_EMBEDDING, Args.h1, Args.h2, Args.h3, 860, 4814) I'm running the model again to see if it work.

@chieter Thank you so much for your detailed works. I think the in the point 2 you mentioned that you initialized the model with: model = Model(len(voc), SIZE_EMBEDDING, Args.h1, Args.h2, Args.h3, SIZE_PAIR_IN, SIZE_SINGLE_IN) I think that makes sense because thanks to the fix that you proposed, I was able to get the PERL scorer working, however, I got this error:

I guess the model initialized was not properly align with my input data, here are my configuration: args.h1 = 1000 args.h2 = 500 args.h3 = 500 SIZE_EMBEDDING = 320

Other parameters of SIZE were used as default in utils.py. Am I doing anything wrong here? I guess I miss-calculated those values. How did you calculate the SIZE_PAIR_IN and the SIZE_SINGLE_IN?. In addition, I wrote an email to @thomwolf and luckily got his reply, as he stated, @svlandeg is the main maintainer of this project.

To get your work started here is what I've been doing:

Install the alpha version of thinc, as this version includes the PyTorchWrapper

Create an Instance of the model in a python shell. I basically used the same instantiation that is used in train/dataset.py: from neuralcoref.train.model import Model model = Model(len(voc), SIZE_EMBEDDING, Args.h1, Args.h2, Args.h3, SIZE_PAIR_IN, SIZE_SINGLE_IN) Adjust the parameters according to the ones you used during traininig.

Load the model checkpoint you want to use: model.load_state_dict(torch.load(path_to_checkpoint_file)) You can find this in train/learn.py.

If everything worked so far you should see a message, that all keys could be matched and if you call the model, you should see different parts of the net labeled (word_embeds), (drop), (pair_top) and (single_top).

As far as I could tell the (pair_top) and (single_top) parts are used in the model that is loaded from neuralcoref cache. I think to avoid exporting the dropout layers used in training you need to change the model to evaluation mode using model.eval() 6.You can now use the PyTorchWrapper from the thinc alpha. from thinc.api import PyTorchWrapper pair_top_model = PyTorchWrapper(model.pair_top) single_top_model = PyTorchWrapper(model.single_top)

These models can now be written to disk using: f = open('trained_pair_top_model', 'wb') f.write(pair_top_model.to_bytes()) f.close() You can save the single_top model in the same manner.

I hope this is helpful to you @EricLe-dev . There are some unsolved problems however. The model loaded from cache has two folders included static_vectors and tuned_vectors. I'm not yet sure if these can be extracted from the model loaded in step 3. Maybe @thomwolf or @svlandeg could elaborate on that. It would also be useful to know if this is even the right approach. Take care everyone and thanks for your work <3

EricLe-dev commented 4 years ago

@chieter it does not work, I guess I still wrongly calculated somewhere else: This is my configuration, as you can see, I modified the SIZE_EMBEDDING = 320, for the rest, I left them as default. With that configuration, the values for SIZE_PAIR_IN will be equal to 5690 and SIZE_SINGLE_IN will be equal to 2834. My vocab length is 626711

But still, I got this exception:

Can you please help me to point out at which point did I do wrong? Thank you so much!

chieter commented 4 years ago

During which step did you get this error? I'm asking because my error message looks a bit different if the sizes of the imported weights and instantiated model do not match up.

EricLe-dev commented 4 years ago

Yes, I got it during the training step, I do python learn.py --train ./data, it runs for a few minutes then I emit that exception. Here is the full stacktrace (including some debugging print that I put in manually): ..\torch\csrc\utils\tensor_numpy.cpp:141: UserWarning: The given NumPy array is not writeable, and PyTorch does not support non-writeable tensors. This means you can write to the underlying (supposedly non-writeable) NumPy array using the tensor. You may want to copy the array to protect its data or make it writeable before converting it to a tensor. This type of warning will be suppressed for the rest of this program. SCORING_SCRIPT: c:\users\administrator\desktop\neural_coref\neuralcoref\neuralcoref\train\scorer_wrapper.pl Namespace(all_pairs_epoch=200, all_pairs_l2=1e-06, all_pairs_lr=0.0002, batchsize=10000, checkpoint_file=None, conll_eval_interval=10, conll_train_interval=20, costfl=0.4, costfn=0.8, costs={'FN': 0.8, 'FL': 0.4, 'WL': 1.0}, costwl=1.0, cuda=False, eval='C:\\Users\\Administrator\\Desktop\\neural_coref\\neuralcoref\\neuralcoref\\train/data//numpy/', evalkey='C:\\Users\\Administrator\\Desktop\\neural_coref\\neuralcoref\\neuralcoref\\train/data//key.txt', h1=1000, h2=500, h3=500, lazy=True, log_interval=10, min_lr=2e-08, numworkers=0, on_eval_decrease='nothing', patience=3, ranking_epoch=200, ranking_l2=1e-05, ranking_lr=2e-06, save_path='C:\\Users\\Administrator\\Desktop\\neural_coref\\neuralcoref\\neuralcoref\\train\\checkpoints\\Jun02_17-43-36_vm05_', seed=1111, startstage=None, startstep=None, top_pairs_epoch=200, top_pairs_l2=1e-05, top_pairs_lr=0.0002, train='./data/numpy/', trainkey='./data/key.txt', weights=None) Training for 200 200 200 epochs ./data/numpy/ loading ./data/numpy/tuned_word_embeddings.npy torch.Size([626711, 320]) loading ./data/numpy/tuned_word_vocabulary.txt Loading Dataset at ./data/numpy/ Reading mentions_features.npy, mentions_labels.npy, mentions_pairs_length.npy, mentions_pairs_start_index.npy, mentions_spans.npy, mentions_words.npy, pairs_ant_index.npy, pairs_features.npy, pairs_labels.npy, static_word_embeddings.npy, tuned_word_embeddings.npy, Loading Dataset at C:\Users\Administrator\Desktop\neural_coref\neuralcoref\neuralcoref\train/data//numpy/ Reading mentions_features.npy, mentions_labels.npy, mentions_pairs_length.npy, mentions_pairs_start_index.npy, mentions_spans.npy, mentions_words.npy, pairs_ant_index.npy, pairs_features.npy, pairs_labels.npy, static_word_embeddings.npy, tuned_word_embeddings.npy, Vocabulary LENGTH HERE: 626711 Build model SIZE_EMBEDDING: 320 SIZE_PAIR_IN: 5690 SIZE_SINGLE_IN: 2834 Loading conll evaluator Preparing batches Dataset has: 14205 batches, 289061 mentions, 150790638 pairs Reading conll_tokens.bin, doc.bin, locations.bin, spacy_lookup.bin, Done Preparing batches Dataset has: 14205 batches, 289061 mentions, 150790638 pairs Reading conll_tokens.bin, doc.bin, locations.bin, spacy_lookup.bin, Done Testing evaluator and getting first eval score Test evaluator / print all mentions Building test file Construct test file Writing in c:\users\administrator\desktop\neural_coref\neuralcoref\neuralcoref\train\test_mentions.txt Computing score Mention identification recall 0 <= Detected mentions 0.0 True mentions 0.0 Scores {'muc': (0, 0, 0), 'bcub': (0, 0, 0), 'ceafe': (0, 0, 0)} F1_conll 0.0 Building test file Build coreference clusters LENGTH INPUT FORWARD: 3 SINGLE INPUT SHAPE: torch.Size([860, 4184]) Traceback (most recent call last): File "learn.py", line 572, in <module> run_model(args) File "learn.py", line 181, in run_model eval_evaluator.build_test_file() File "c:\users\administrator\desktop\neural_coref\neuralcoref\neuralcoref\train\evaluator.py", line 200, in build_test_file scores, max_i = self.get_max_score(sample_batched) File "c:\users\administrator\desktop\neural_coref\neuralcoref\neuralcoref\train\evaluator.py", line 169, in get_max_score scores = self.model(inputs, concat_axis=1) File "C:\Users\Administrator\Desktop\neural_coref\venv\lib\site-packages\torch\nn\modules\module.py", line 550, in __call__ result = self.forward(*input, **kwargs) File "c:\users\administrator\desktop\neural_coref\neuralcoref\neuralcoref\train\model.py", line 104, in forward single_scores = self.single_top(single_input) File "C:\Users\Administrator\Desktop\neural_coref\venv\lib\site-packages\torch\nn\modules\module.py", line 550, in __call__ result = self.forward(*input, **kwargs) File "C:\Users\Administrator\Desktop\neural_coref\venv\lib\site-packages\torch\nn\modules\container.py", line 100, in forward input = module(input) File "C:\Users\Administrator\Desktop\neural_coref\venv\lib\site-packages\torch\nn\modules\module.py", line 550, in __call__ result = self.forward(*input, **kwargs) File "C:\Users\Administrator\Desktop\neural_coref\venv\lib\site-packages\torch\nn\modules\linear.py", line 87, in forward return F.linear(input, self.weight, self.bias) File "C:\Users\Administrator\Desktop\neural_coref\venv\lib\site-packages\torch\nn\functional.py", line 1610, in linear ret = torch.addmm(bias, input, weight.t()) RuntimeError: size mismatch, m1: [860 x 4184], m2: [2834 x 1000] at C:\w\b\windows\pytorch\aten\src\TH/generic/THTensorMath.cpp:41

chieter commented 4 years ago

Ok, maybe I should have made this clearer, all my instructions were meant to be executed after training, when you already have a checkpoint file from training. For training my model the only values I changed were SIZE_EMBEDDING and SIZE_SPAN in utils.py. According to this post the span vectors are 5 vectors, that average embeddings and so they should be 5 times the size of your embeddings. You maybe also need to change this assertion dataset.py.

EricLe-dev commented 4 years ago

@chieter thanks to your information, I have been able to run the training for my model. Just that it seems to be taking extremely long to finish 1 Epoch.

EricLe-dev commented 4 years ago

Also, at this line I can see that the add_to_pipe function makes a call to load a file name vocab.txt, I guess it is our static_word_vocabulary.txt.

In addition, this line specifies the path to load the model, I guess we can manually modify it, then do a pip install -e . to build NeuralCoref from source.

EricLe-dev commented 4 years ago

@chieter Hi mate, Have you ever encountered this error during training? I reduced the number of workers to 3, with smaller batch size but still I got this error. I also activated the --lazy flag during the training. Here is my environment: Windows 10 Python 3.8 RAM 64GB GPU 8GB

Thank you so much!

EricLe-dev commented 4 years ago

I fixed it by turning off the pin_memory

EricLe-dev commented 4 years ago

@chieter can you please tell me how to construct the file key2row in tuned_vectors? Thank you so much <3

stale[bot] commented 3 years ago

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.