neulab / awesome-align

A neural word aligner based on multilingual BERT
https://arxiv.org/abs/2101.08231
BSD 3-Clause "New" or "Revised" License
321 stars 46 forks source link

Trying to train "ixa-ehu/ixambert-base-cased" model #48

Open jmurua14 opened 2 years ago

jmurua14 commented 2 years ago

Hi!

You have done a great job!! I have been training two different models. the one mentioned in the title ("ixa-ehu/ixambert-base-cased") and multibert_cased. With the multibert I didn't have any problems with the training, however, when I try to train the other model it says that I have a missmatch with the shape of the vocabulary size.

In the config file of the "ixa-ehu/ixambert-base-cased" model the vocabulary size is the following one: 08/18/2022 09:41:28 - INFO - awesome_align.configuration_utils - Model config BertConfig { "architectures": null, "attention_probs_dropout_prob": 0.1, "bos_token_id": null, "do_sample": false, "eos_token_ids": null, "finetuning_task": null, "hidden_act": "gelu", "hidden_dropout_prob": 0.1, "hidden_size": 768, "id2label": { "0": "LABEL_0", "1": "LABEL_1" }, "initializer_range": 0.02, "intermediate_size": 3072, "is_decoder": false, "label2id": { "LABEL_0": 0, "LABEL_1": 1 }, "layer_norm_eps": 1e-12, "length_penalty": 1.0, "max_length": 20, "max_position_embeddings": 512, "model_type": "bert", "num_attention_heads": 12, "num_beams": 1, "num_hidden_layers": 12, "num_labels": 2, "num_return_sequences": 1, "output_attentions": false, "output_hidden_states": false, "output_past": true, "pad_token_id": null, "repetition_penalty": 1.0, "temperature": 1.0, "top_k": 50, "top_p": 1.0, "torchscript": false, "type_vocab_size": 2, "use_bfloat16": false, "vocab_size": 119099 }

When I begin with the training i get this error: Iteration: 0%| | 0/40000 [00:00<?, ?it/s]Traceback (most recent call last): File "/mnt/datuak/virtualenvs/transformers/bin/awesome-train", line 8, in sys.exit(main()) File "/mnt/datuak/virtualenvs/transformers/lib/python3.6/site-packages/awesome_align/run_train.py", line 848, in main global_step, tr_loss = train(args, train_dataset, model, tokenizer) File "/mnt/datuak/virtualenvs/transformers/lib/python3.6/site-packages/awesome_align/run_train.py", line 370, in train loss = model(inputs_src=inputs_src, labels_src=labels_src) File "/mnt/datuak/virtualenvs/transformers/lib/python3.6/site-packages/torch/nn/modules/module.py", line 889, in _call_impl result = self.forward(*input, **kwargs) File "/mnt/datuak/virtualenvs/transformers/lib/python3.6/site-packages/awesome_align/modeling.py", line 660, in forward masked_lm_loss = loss_fct(prediction_scores_src.view(-1, self.config.vocab_size), labels_src.view(-1)) RuntimeError: shape '[-1, 119101]' is invalid for input of size 5716752

As you can see the vocab_size has increased by 2 from 119099 to 119101. This is due to the CLS and SEP tokens, however, I don't know why I get this error. I have tried to manually decrease the vocab_size in the code, but this leads to some other errors when I make the alignments.

I leave you here the awesome-train command I have used for training: CUDA_VISIBLE_DEVICES=1 awesome-train \ --output_dir=$OUTPUT_DIR \ --model_name_or_path=ixa-ehu/ixambert-base-cased \ --extraction 'softmax' \ --do_train \ --train_mlm \ --train_tlm \ --train_tlm_full \ --train_so \ --train_psi \ --train_co \ --train_data_file=$TRAIN_FILE \ --per_gpu_train_batch_size 2 \ --gradient_accumulation_steps 4 \ --num_train_epochs 1 \ --learning_rate 2e-5 \ --save_steps 10000 \ --max_steps 40000 \

Could you please help me solve this issue?

Thanks!

zdou0830 commented 2 years ago

Hi, right now the repo only supports mBERT and XLM-R. You can check this commit to see how to incorporate a new model.