nyu-mll / jiant-v1-legacy

The jiant toolkit for general-purpose text understanding models
MIT License
21 stars 9 forks source link

[CLOSED] Unable to Use XLM as an input_module #1079

Closed jeswan closed 4 years ago

jeswan commented 4 years ago

Issue by tejasvi96 Tuesday Apr 28, 2020 at 07:34 GMT Originally opened as https://github.com/nyu-mll/jiant/issues/1079


I was trying to use XLM model as an input_module for some language modelling task. I am getting this error-

04/28 12:46:07 PM: Fatal error in main(): Traceback (most recent call last): File "main.py", line 16, in main(sys.argv[1:]) File "D:\My\jiant\jiant__main.py", line 588, in main phase="pretrain", File "D:\My\jiant\jiant\trainer.py", line 579, in train output_dict = self._forward(batch, task=task) File "D:\My\jiant\jiant\trainer.py", line 1043, in _forward model_out = self._model.forward(task, batch) File "D:\My\jiant\jiant\models.py", line 865, in forward out = self._single_sentence_forward(batch, task, predict) File "D:\My\jiant\jiant\models.py", line 937, in _single_sentence_forward word_embs_in_context, sent_mask = self.sent_encoder(batch["input1"], task) File "C:\Users\Tejasvi\Anaconda3\envs\jiant\lib\site-packages\torch\nn\modules\module.py", line 493, in call result = self.forward(*input, kwargs) File "D:\My\jiant\jiant\modules\sentence_encoder.py", line 93, in forward word_embs_in_context = self._highway_layer(self._text_field_embedder(sent)) File "C:\Users\Tejasvi\Anaconda3\envs\jiant\lib\site-packages\torch\nn\modules\module.py", line 493, in call result = self.forward(*input, *kwargs) File "D:\My\jiant\jiant\huggingface_transformers_interface\modules.py", line 719, in forward ids, input_mask = self.correct_sent_indexing(sent) File "D:\My\jiant\jiant\huggingface_transformers_interface\modules.py", line 107, in correct_sent_indexing ), "transformers cannot find correcpondingly tokenized input" AssertionError: transformers cannot find correcpondingly tokenized input Traceback (most recent call last): File "main.py", line 27, in raise e # re-raise exception, in case debugger is attached. File "main.py", line 16, in main(sys.argv[1:]) File "D:\My\jiant\jiant__main.py", line 588, in main phase="pretrain", File "D:\My\jiant\jiant\trainer.py", line 579, in train output_dict = self._forward(batch, task=task) File "D:\My\jiant\jiant\trainer.py", line 1043, in _forward model_out = self._model.forward(task, batch) File "D:\My\jiant\jiant\models.py", line 865, in forward out = self._single_sentence_forward(batch, task, predict) File "D:\My\jiant\jiant\models.py", line 937, in _single_sentence_forward word_embs_in_context, sent_mask = self.sent_encoder(batch["input1"], task) File "C:\Users\Tejasvi\Anaconda3\envs\jiant\lib\site-packages\torch\nn\modules\module.py", line 493, in call__ result = self.forward(input, kwargs) File "D:\My\jiant\jiant\modules\sentence_encoder.py", line 93, in forward word_embs_in_context = self._highway_layer(self._text_field_embedder(sent)) File "C:\Users\Tejasvi\Anaconda3\envs\jiant\lib\site-packages\torch\nn\modules\module.py", line 493, in call__ result = self.forward(*input, **kwargs) File "D:\My\jiant\jiant\huggingface_transformers_interface\modules.py", line 719, in forward ids, input_mask = self.correct_sent_indexing(sent) File "D:\My\jiant\jiant\huggingface_transformers_interface\modules.py", line 107, in correct_sent_indexing ), "transformers cannot find correcpondingly tokenized input" AssertionError: transformers cannot find correcpondingly tokenized input

On printing the output of the sent from here - Github

I get for sent: {'words': tensor([[ 2, 19, 5, 119, 6, 1, 10, 137, 48, 3], [ 2, 18, 5, 387, 908, 162, 8, 87, 13, 3], [ 2, 74, 245, 141, 5, 1, 886, 44, 1, 3], [ 2, 20, 1, 1, 7, 1, 435, 24, 14, 3]])} for self.tokenizer_required XLM_en

I have used these settings in the tutorial.conf file exp_name = jiant-demo run_name = mtl-sst-mrpc

random_seed = 42

load_model = 0 reload_tasks = 0 reload_indexing = 0 reload_vocab = 0

pretrain_tasks = "sst" target_tasks = "sts-b" classifier = log_reg classifier_hid_dim = 32 max_seq_len = 33 max_word_v_size = 8000 pair_attn = 0

input_module = xlm-mlm-en-2048 d_word = 300

What could be the possible reason for the same.?

jeswan commented 4 years ago

Comment by sleepinyourhat Wednesday Apr 29, 2020 at 14:43 GMT


@zphang, any guesses (since you were recently working on XLM-R)?

FWIW, it seems odd that the log is shoing the tokenizer name "XLM_en"—the name that appears in our code is the lowercase "xlm_en":

https://github.com/nyu-mll/jiant/search?q=XLM_en&unscoped_q=XLM_en

jeswan commented 4 years ago

Comment by tejasvi96 Wednesday Apr 29, 2020 at 15:08 GMT


Thanks for replying team. Actually my apologies for the same. The output was in small case only - xlm_en.

jeswan commented 4 years ago

Comment by tejasvi96 Saturday May 09, 2020 at 06:14 GMT


Hi team, The issue got resolved it was a configuration issue on my part.