microsoft / unilm

Large-scale Self-supervised Pre-training Across Tasks, Languages, and Modalities
https://aka.ms/GeneralAI
MIT License
19.25k stars 2.45k forks source link

Question generation using multi-lingual minilm #170

Open Aniruddha-JU opened 4 years ago

Aniruddha-JU commented 4 years ago

I am using multilingual-minilm for qg task. I have formed the dataset like English. but I am getting if isinstance(example["src"], list): KeyError: 'src'

wenhui0924 commented 4 years ago

Hi @Aniruddha-JU,

Could you provide some samples of your processed data?

Thanks

Aniruddha-JU commented 4 years ago

This is my data 1) {"src": ["তারপর", "উক্ত", "৮জনের", "সাথে", "আরো", "কয়েকজনকে", "একত্র", "করে", "২০", "জনের", "একটি", "গেরিলা", "দল", "গঠন", "করে", "তাদের", "ভারতে", "বিশেষ", "ট্রেনিং", "দেয়া", "হয়।", "তারপর", "তারা", "দেশে", "আসলে", "তাদের", "সাথে", "কর্নেল", "ওসমানীর", "দেখা", "করানো", "হয়।", "তখন", "ওসমানী", "নৌ-কমান্ডো", "বাহিনী", "গঠনের", "সিদ্ধান্ত", "নেন।", "[SEP]", "ওসমানী"], "tgt": ["অপারেশন", "জ্যাকপটের", "কর্নেল", "কে", "ছিলেন", "?"]}

2) This is my command python run_seq2seq.py --train_file ../../../bert_new/unilm_data_model/bengali_train/train.json --output_dir ../../../bert_new/unilm_data_model/bert_save/ --model_type minilm --model_name_or_path ../../../bert_new/unilm_data_model/Multilingual-MiniLM-L12-H384/multilingual-minilm-l12-h384.bin --tokenizer_name ../../../bert_new/unilm_data_model/Multilingual-MiniLM-L12-H384/vocab.txt --config_name ../../../bert_new/unilm_data_model/Multilingual-MiniLM-L12-H384/multilingual-minilm-l12-h384-config.json --max_source_seq_length 464 --max_target_seq_length 48 --per_gpu_train_batch_size 16 --gradient_accumulation_steps 1 --learning_rate 7e-5 --num_warmup_steps 500 --num_training_steps 32000 --cache_dir ../../../bert_new/unilm_data_model/

wenhui0924 commented 4 years ago

It seemed that the data format is correct. But I think the error is because of L138 in utils.py. Could you print more information at that point to debug problems?

othman-zennaki commented 4 years ago

Hi @WenhuiWang0824, I am using multilingual MiniLM on FQuAD: French Question Answering Dataset.

This is my command for fine tuning : python run_seq2seq.py --train_file ${TRAIN_FILE} --output_dir ${OUTPUT_DIR} --model_type minilm --model_name_or_path ${MODEL_PATH}/multilingual-minilm-l12-h384.bin --tokenizer_name ${MODEL_PATH}/vocab.txt --config_name ${MODEL_PATH}/multilingual-minilm-l12-h384-config.json --max_source_seq_length 464 --max_target_seq_length 48 --per_gpu_train_batch_size 8 --gradient_accumulation_steps 1 --learning_rate 7e-5 --num_warmup_steps 500 --num_training_steps 32000 --cache_dir ${CACHE_DIR} My question is : when I use the decoding the questions generated contains only UNK (see examples below), How can I deal with that?
[UNK] s ' appelle le [UNK] de la [UNK] [UNK] de [UNK] ? [UNK] est le [UNK] de la [UNK] [UNK] de [UNK] ? [UNK] est - ce que le [UNK] ' s [UNK] a [UNK] le 7 [UNK] 2016 ? [UNK] est le nom de la [UNK] [UNK] par [UNK] ? Thanks in advance Othman

Neuronys commented 4 years ago

Hi @WenhuiWang0824 @othman-zennaki I'm trying to do the same on FQUAD I'm facing exactly the same problem when decoding, with tons of [UNK]. It looks like we are not using the right tokenizer and/or vocab file. Could you please give us some clue ? Thanks in advance Philippe

addf400 commented 4 years ago

@Neuronys @othman-zennaki Can you provide your decoding script ? If you append '--do_lower_case' will convert the text into lowercase and only for uncased model.

wenhui0924 commented 4 years ago

Hi @othman-zennaki and @Neuronys,

I agree with @Neuronys that this could be caused by the tokenizer and/or vocab file. Could you provide some processed samples of your training data and the final training loss?

Neuronys commented 4 years ago

Hi @addf400 @WenhuiWang0824 I did some investigation and it seems we already have a problem at training time.

As requested, here is a sample of the training data: {"src": "La composition de la surface de Cérès est largement similaire, mais pas identique, à celle des astéroïdes de type C. Le spectre infrarouge de Cérès fait apparaître des matériaux hydratés qui indiquent la présence de quantités significatives d'eau à l'intérieur de l'objet. Parmi les autres possibles constituants de la surface, il y aurait de l'argile riche en fer (cronstedtite) et des composés carbonatés (dolomite et sidérite), minéraux courants dans les météorites chondrites carbonées. Les caractéristiques spectrales des carbonates et de l'argile sont généralement absentes du spectre des autres astéroïdes de type C. Cérès est parfois classifié comme un astéroïde de type G. [SEP] carbonates et de l'argile", "tgt": "Que possède Cérès dans sa composition que les autres astéroïdes de type C ne possèdent pas ?"} {"src": "La composition de la surface de Cérès est largement similaire, mais pas identique, à celle des astéroïdes de type C. Le spectre infrarouge de Cérès fait apparaître des matériaux hydratés qui indiquent la présence de quantités significatives d'eau à l'intérieur de l'objet. Parmi les autres possibles constituants de la surface, il y aurait de l'argile riche en fer (cronstedtite) et des composés carbonatés (dolomite et sidérite), minéraux courants dans les météorites chondrites carbonées. Les caractéristiques spectrales des carbonates et de l'argile sont généralement absentes du spectre des autres astéroïdes de type C. Cérès est parfois classifié comme un astéroïde de type G. [SEP] astéroïde de type G", "tgt": "A quel groupe appartient Cérès ?"} {"src": "La composition de la surface de Cérès est largement similaire, mais pas identique, à celle des astéroïdes de type C. Le spectre infrarouge de Cérès fait apparaître des matériaux hydratés qui indiquent la présence de quantités significatives d'eau à l'intérieur de l'objet. Parmi les autres possibles constituants de la surface, il y aurait de l'argile riche en fer (cronstedtite) et des composés carbonatés (dolomite et sidérite), minéraux courants dans les météorites chondrites carbonées. Les caractéristiques spectrales des carbonates et de l'argile sont généralement absentes du spectre des autres astéroïdes de type C. Cérès est parfois classifié comme un astéroïde de type G. [SEP] astéroïde de type G", "tgt": "A quel groupe appartient Cérès ?"} {"

Here is my training command TRAIN_FILE=data/questionGeneration_FSquad_train_sentence.jsonl MODEL_NAME=Multilingual-MiniLM-L12-H384/multilingual-minilm-l12-h384.bin CONFIG_NAME=Multilingual-MiniLM-L12-H384/multilingual-minilm-l12-h384-config.json VOCAB_NAME=Multilingual-MiniLM-L12-H384/vocab.txt OUTPUT_DIR=checkpoints CACHE_DIR=cache CUDA_VISIBLE_DEVICES=0 python s2s-ft/run_seq2seq.py \ --train_file ${TRAIN_FILE} --output_dir ${OUTPUT_DIR} \ --model_type minilm \ --model_name_or_path ${MODEL_NAME} \ --tokenizer_name ${VOCAB_NAME} \ --config_name ${CONFIG_NAME} \ --do_lower_case --fp16 --fp16_opt_level O2 --max_source_seq_length 464 --max_target_seq_length 48 \ --per_gpu_train_batch_size 6 --gradient_accumulation_steps 1 \ --learning_rate 5e-5 --num_warmup_steps 500 --num_training_epochs 15 --cache_dir ${CACHE_DIR}

Here is the initial logs I got 06/10/2020 13:20:28 - INFO - transformers.tokenization_utils - Model name 'Multilingual-MiniLM-L12-H384/vocab.txt' not found in model shortcut name list (minilm-l12-h384-uncased). Assuming 'Multilingual-MiniLM-L12-H384/vocab.txt' is a path, a model identifier, or url to a directory containing tokenizer files. 06/10/2020 13:20:28 - WARNING - transformers.tokenization_utils - Calling MinilmTokenizer.from_pretrained() with the path to a single file or url is deprecated 06/10/2020 13:20:28 - INFO - transformers.tokenization_utils - loading file Multilingual-MiniLM-L12-H384/vocab.txt 06/10/2020 13:20:28 - INFO - transformers.modeling_utils - loading weights file Multilingual-MiniLM-L12-H384/multilingual-minilm-l12-h384.bin 06/10/2020 13:20:29 - INFO - transformers.modeling_utils - Weights of BertForSequenceToSequence not initialized from pretrained model: ['cls.predictions.decoder_weight', 'crit_mask_lm_smoothed.one_hot'] 06/10/2020 13:20:29 - INFO - s2s_ft.utils - Loading features from cached file checkpoints/cached_features_for_training.pt Selected optimization level O2: FP16 training with FP32 batchnorm and FP32 master weights. Defaults for this optimization level are: enabled : True opt_level : O2 cast_model_type : torch.float16 patch_torch_functions : False keep_batchnorm_fp32 : True master_weights : True loss_scale : dynamic Processing user overrides (additional kwargs that are not None)... After processing overrides, optimization options are: enabled : True opt_level : O2 cast_model_type : torch.float16 patch_torch_functions : False keep_batchnorm_fp32 : True master_weights : True loss_scale : dynamic 06/10/2020 13:20:31 - INFO - __main__ - Check dataset: 06/10/2020 13:20:31 - INFO - __main__ - Instance-0 06/10/2020 13:20:31 - INFO - __main__ - Source tokens = [CLS] [UNK] avoir accepte la [UNK] de [UNK] [UNK] long [UNK] , [UNK] a du faire face en effet , par ordre [UNK] , a la [UNK] des [UNK] [UNK] ( a [UNK] de fin mai 1941 ) , au [UNK] en [UNK] [UNK] ( a [UNK] de mi - [UNK] 1941 ) et a l ' [UNK] en [UNK] des [UNK] - [UNK] . on [UNK] [UNK] [UNK] de son [UNK] [UNK] la [UNK] de [UNK] ( 1942 ) [UNK] « il [UNK] [UNK] au succes » et « [UNK] la [UNK] [UNK] plus [UNK] en [UNK] de la [UNK] a [UNK] de [UNK] [UNK] des [UNK] » . [SEP] [UNK] [SEP] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] 06/10/2020 13:20:31 - INFO - __main__ - Target tokens = qui a accepte la [UNK] de [UNK] [UNK] long [UNK] ? [SEP] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] 06/10/2020 13:20:31 - INFO - __main__ - Instance-1 A lot of [UNK] already :-( Like if the wrong tokenizer were used !

Decoding seems to have the same problem in the logs (.venv) (base) neuronys@neuronys-nlp:~/DEV-2020/uquiz-training$ bash uquiz-decoding.gpu.xminilm.sh 06/10/2020 13:15:59 - INFO - transformers.tokenization_utils - Model name 'Multilingual-MiniLM-L12-H384' not found in model shortcut name list (minilm-l12-h384-uncased). Assuming 'Multilingual-MiniLM-L12-H384' is a path, a model identifier, or url to a directory containing tokenizer files. 06/10/2020 13:15:59 - INFO - transformers.tokenization_utils - Didn't find file Multilingual-MiniLM-L12-H384/added_tokens.json. We won't load it. 06/10/2020 13:15:59 - INFO - transformers.tokenization_utils - Didn't find file Multilingual-MiniLM-L12-H384/special_tokens_map.json. We won't load it. 06/10/2020 13:15:59 - INFO - transformers.tokenization_utils - Didn't find file Multilingual-MiniLM-L12-H384/tokenizer_config.json. We won't load it. 06/10/2020 13:15:59 - INFO - transformers.tokenization_utils - loading file Multilingual-MiniLM-L12-H384/vocab.txt 06/10/2020 13:15:59 - INFO - transformers.tokenization_utils - loading file None 06/10/2020 13:15:59 - INFO - transformers.tokenization_utils - loading file None 06/10/2020 13:15:59 - INFO - transformers.tokenization_utils - loading file None 06/10/2020 13:15:59 - INFO - __main__ - Read decoding config from: models/fr.xminilm.QG.fquad.51827/config.json models/fr.xminilm.QG.fquad.51827 06/10/2020 13:15:59 - INFO - __main__ - ***** Recover model: models/fr.xminilm.QG.fquad.51827 ***** It seems that the model Multilingual-MiniLM-L12-H384 is not supported yet in the code ! So no surprise that we get crap out ;-)

For information, my decode command is MODEL_PATH=models/fr.xminilm.QG.fquad.51827 SPLIT=dev INPUT_JSON=data/questionGeneration_FSquad_valid_sentence.jsonl CONFIG_NAME=models/fr.xminilm.QG.fquad.51827/config.json VOCAB_NAME=Multilingual-MiniLM-L12-H384 export CUDA_VISIBLE_DEVICES=1 export OMP_NUM_THREADS=4 export MKL_NUM_THREADS=4 python s2s-ft/decode_seq2seq.py \ --model_type minilm --fp16 --tokenizer_name ${VOCAB_NAME} --input_file ${INPUT_JSON} --split $SPLIT --do_lower_case \ --model_path ${MODEL_PATH} --config_path ${CONFIG_NAME} --max_seq_length 512 --max_tgt_length 48 --batch_size 12 --beam_size 5 \ --length_penalty 0 --forbid_duplicate_ngrams --mode s2s --forbid_ignore_word "." --need_score_traces

But, with the [UNK] at training time, I'm pretty sure that I'm doing something wrong when finetuning the model. Could you explain us how you are using it on your side ? Thanks Philippe

othman-zennaki commented 4 years ago

@Neuronys I have observed the same thing. Even in English: "src": ["the", "league", "held", "its", "first", "season", "in", "1992", "\u2013", "93", "and", "was", "originally", "composed", "of", "22", "clubs", ...] Source tokens = [CLS] the [UNK] held its [UNK] [UNK] in 1992 \u2013 93 and was [UNK] [UNK] of 22 [UNK] ...

Neuronys commented 4 years ago

The modified run_xnli.py to train Cross Lingual Natural Language Inference contains this line: "minilm": (BertConfig, BertForSequenceClassification, XLMRobertaTokenizer), So using --model_type minilm will use the desired / expected XLMRobertaTokenizer.

In run_seq2seq.py we have: 'minilm': (MinilmConfig, MinilmTokenizer), 'xlm-roberta': (XLMRobertaConfig, XLMRobertaTokenizer), but nothing to force using a mixed of BERT & XLM-Roberta as expected. That may explain the wrong tokenization at training time, hence at decoding time.

@WenhuiWang0824 : is the repo up-to-date for manager cross-lingual minilm ? Cheers Philippe

wenhui0924 commented 4 years ago

Thanks @Neuronys! I agree with you that the examples are not processed properly because of the tokenizer.

@othman-zennaki, your command uses the tokenizer of English MiniLM (the same tokenizer as BERT). But the multilingual MiniLM uses XLMR's tokenizer, which uses sentence piece. You need to add the tokenizer of XLMR to the s2s-ft package to run multilingual MiniLM.

The current s2s-ft package does not support generation tasks of multilingual MiniLM. We are working on it and will be released later.

Thanks

Neuronys commented 4 years ago

Thanks for your answer @WenhuiWang0824 When will you release an update ?

@othman-zennaki : could you let me know if you update the s2s-ft package on your own ? Thanks in advance Philippe

Neuronys commented 4 years ago

Any update on this ? Thanks

donglixp commented 4 years ago

@addf400 will help on this issue.

Neuronys commented 4 years ago

Thanks @donglixp @addf400 please could you give me some clue to move forward on this subject ? Cheers

Neuronys commented 3 years ago

@addf400 any update on it ? Thanks

Neuronys commented 3 years ago

@addf400 @donglixp @WenhuiWang0824 Sorry for this harasment, but I'm struggled to make it work on a french dataset. Would you be so kind to give me some clues ? Should I move to INFOXLM model instead ? where can we download it ? Thanks in advance Philippe

donglixp commented 3 years ago

The class XLMRobertaTokenizer (https://github.com/microsoft/unilm/blob/master/s2s-ft/run_seq2seq.py#L42 ) can be used for https://github.com/microsoft/unilm/blob/master/s2s-ft/run_seq2seq.py#L383 , so that the tokenizer matches multilingual MiniLM. @WenhuiWang0824 @addf400 Please correct me if I'm wrong.