microsoft / unilm

Large-scale Self-supervised Pre-training Across Tasks, Languages, and Modalities
https://aka.ms/GeneralAI
MIT License
20.09k stars 2.55k forks source link

Output of the decode seems perplexing #126

Closed johnyoonh closed 4 years ago

johnyoonh commented 4 years ago

When running the decode, I expected the output to resemble something close to target in the input file. I've created iput.json as shown in the README.md in s2s-ft:

{"src": "Messages posted on social media claimed the user planned to `` kill as many people as possible ''", "tgt": "Threats to kill pupils in a shooting at a Blackpool school are being investigated by Lancashire police ."}
{"src": "Media playback is unsupported on your device", "tgt": "A slide running the entire length of one of the steepest city centre streets in Europe has been turned into a massive three-lane water adventure ."}
{"src": "Chris Erskine crossed low for Kris Doolan to tap home and give the Jags an early lead .", "tgt": "Partick Thistle will finish in the Scottish Premiership 's top six for the first time after beating Motherwell"}

I've ran following inside unzipped pre-trained model folder so that file names are as expected: mv minilm-l12-h384-uncased-config.json config.json mv minilm-l12-h384-uncased.bin pytorch_model.bin

Also specified map_location="cpu" to torch.load on modeling_decoding.py:784 state_dict = torch.load(weights_path, map_location='cpu')

And I ran the command. Note that I removed --fp16 because I am running on CPU

MODEL_PATH=MiniLM-L12-H384-uncased
VOCAB_PATH=MiniLM-L12-H384-uncased
SPLIT=validation
INPUT_JSON=input.json

export CUDA_VISIBLE_DEVICES=0
export OMP_NUM_THREADS=4
export MKL_NUM_THREADS=4
python s2s-ft/decode_seq2seq.py \
  --model_type minilm --tokenizer_name minilm-l12-h384-uncased \
  --input_file ${INPUT_JSON} --split dev --do_lower_case \
  --model_path ${MODEL_PATH} --max_seq_length 512 --max_tgt_length 48 --batch_size 32 --beam_size 5 \
  --length_penalty 0 --forbid_duplicate_ngrams --mode s2s --forbid_ignore_word "." --need_score_traces

Better speed can be achieved with apex installed from https://www.github.com/nvidia/apex. 05/02/2020 23:43:03 - INFO - transformers.tokenization_utils - loading file https://unilm.blob.core.windows.net/ckpt/minilm-l12-h384-uncased-vocab.txt from cache at /home/john/.cache/torch/transformers/c6a0d170b6fcc6d023a402d9c81e5526a82 901ffed3eb6021fb0ec17cfd24711.0af242a3765cd96e2c6ad669a38c22d99d583824740a9a2b36fe3ed5a07d0503 05/02/2020 23:43:03 - INFO - main - Read decoding config from: MiniLM-L12-H384-uncased/config.json MiniLM-L12-H384-uncased 05/02/2020 23:43:03 - INFO - main - Recover model: MiniLM-L12-H384-uncased 05/02/2020 23:43:03 - INFO - s2s_ft.modeling_decoding - Model config { "attention_probs_dropout_prob": 0.1, "ffn_type": 0, "fp32_embedding": false, "hidden_act": "gelu", "hidden_dropout_prob": 0.1, "hidden_size": 384, "initializer_range": 0.02, "intermediate_size": 1536, "label_smoothing": null, "max_position_embeddings": 512, "new_pos_ids": false, "no_segment_embedding": false, "num_attention_heads": 12, "num_hidden_layers": 12, "num_qkv": 0, "relax_projection": 0, "seg_emb": false, "source_type_id": 0, "target_type_id": 1, "task_idx": null, "type_vocab_size": 2, "vocab_size": 30522 }

05/02/2020 23:43:04 - INFO - s2s_ft.utils - Creating features from dataset file at input.json 100%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 3/3 [00:00<00:00, 1357.53it/s] 0%| | 0/1 [00:00<?, ?it/s] 05/02/2020 23:43:04 - INFO - s2s_ft.s2s_loader - Input src = [CLS] messages posted on social media claimed the user planned to kill as many people as possible ' ' [SEP] 05/02/2020 23:43:04 - INFO - s2s_ft.s2s_loader - Input src = [CLS] chris erskine crossed low for kris doo ##lan to tap home and give the ja ##gs an early lead . [SEP] 05/02/2020 23:43:04 - INFO - s2s_ft.s2s_loader - Input src = [CLS] media playback is un ##su ##pp ##orted on your device [SEP] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] 05/02/2020 23:43:12 - INFO - main - 0 = scroll scroll scroll hedge scroll logic logic inclined logic thoughts sides table table table punt table punt punt table dive punt dive punt punt punt self self self financial self self investo r self selfiti hedge self self sentiment hedge relations hedge friends self metro metroulul 05/02/2020 23:43:12 - INFO - main - 2 = operationsace sides sides sidesaceaceace sides header header headerhab header headerchai header headersus headerchaiitiitiaceace hedgeves financial self selface analysts admitted admitted admit teditihelace hedge self self self himself himself self metro self metro 05/02/2020 23:43:12 - INFO - main - 1 = scroll scroll shouldn plymouth scroll scrollgamgam scrollgam scroll scroll scroll briefs flourish ground volleyball should shouldnow logic logic tacticsaceaceace thought mirdehel profile profil e counterhelhel profile bar portfolio counter portfolio portfolio portfolio def portfolio portfolio hedge portfolio portfolio


See the last few lines of the output, it is gibberish that doesn't look anything like a sentence. Is there a parameter that I missed or specified erroneously? When I tried Unilm-v1 way back, the input format was different, but the output was decent with some attributes of summarization.

donglixp commented 4 years ago

Hi @johnyoonh , please correct me if I misunderstand your issue, did you directly use the minilm checkpoint for decoding (without any fine-tuning)? The example json is just used to show the format. You could go through the instructions as in https://github.com/microsoft/unilm/tree/master/s2s-ft#example-xsum-with-minilm-l12-h384-uncased in order to try minilm for seq2seq learning.

johnyoonh commented 4 years ago

I directly used the model checkpoint minilm-l12-h384-uncased.bin without any fine-tuning. Before I start fine-tuning, I wanted to make sure the minilm was working as expected.

donglixp commented 4 years ago

I directly used the model checkpoint minilm-l12-h384-uncased.bin without any fine-tuning. Before I start fine-tuning, I wanted to make sure the minilm was working as expected.

I see. It's expected that the model was fine-tuned on the end task. The example on XSum can be found at https://github.com/microsoft/unilm/tree/master/s2s-ft#example-xsum-with-minilm-l12-h384-uncased .

johnyoonh commented 4 years ago

I fine-tuned on AWS g4dn.x12large (T4 x4) on conda for 17 hours.

I did not end up using docker as I was facing some issues, so instead, I installed packages used by pytorch/pytorch:1.2-cuda10.0-cudnn7-devel image

conda install pytorch==1.2.0 torchvision==0.4.0 cudatoolkit=10.0 -c pytorch

This is the output I got:

MODEL_PATH=first
SPLIT=validation
INPUT_JSON=./xsum.validation.json
export CUDA_VISIBLE_DEVICES=0
export OMP_NUM_THREADS=4
export MKL_NUM_THREADS=4

python s2s-ft/decode_seq2seq.py \
  --fp16 --model_type minilm --tokenizer_name minilm-l12-h384-uncased --input_file ${INPUT_JSON} --split $SPLIT --do_lo
wer_case \
  --model_path ${MODEL_PATH} --max_seq_length 512 --max_tgt_length 48 --batch_size 32 --beam_size 5 \
  --length_penalty 0 --forbid_duplicate_ngrams --mode s2s --forbid_ignore_word "."
05/06/2020 14:48:32 - INFO - transformers.tokenization_utils -   loading file https://unilm.blob.core.windows.net/ckpt/
minilm-l12-h384-uncased-vocab.txt from cache at /home/ubuntu/.cache/torch/transformers/c6a0d170b6fcc6d023a402d9c81e5526
a82901ffed3eb6021fb0ec17cfd24711.0af242a3765cd96e2c6ad669a38c22d99d583824740a9a2b36fe3ed5a07d0503
05/06/2020 14:48:32 - INFO - __main__ -   Read decoding config from: first/config.json
first
05/06/2020 14:48:32 - INFO - __main__ -   ***** Recover model: first *****
05/06/2020 14:48:32 - INFO - s2s_ft.modeling_decoding -   Model config {
  "adam_epsilon": 1e-08,
  "attention_probs_dropout_prob": 0.1,
  "cache_dir": "./cache",
  "cached_train_features_file": null,
  "config_name": null,
  "do_lower_case": true,
  "ffn_type": 0,
  "fp16": true,
  "fp16_opt_level": "O2",
  "fp32_embedding": false,
  "gradient_accumulation_steps": 1,
  "hidden_act": "gelu",
  "hidden_dropout_prob": 0.1,
  "hidden_size": 768,
  "initializer_range": 0.02,
  "intermediate_size": 3072,
  "keep_prob": 0.1,
  "label_smoothing": 0.1,
  "learning_rate": 0.0001,
  "local_rank": 1,
  "log_dir": null,
  "logging_steps": 500,
  "max_grad_norm": 1.0,
  "max_position_embeddings": 512,
  "max_source_seq_length": 464,
  "max_target_seq_length": 48,
  "model_name_or_path": "minilm-l12-h384-uncased",
  "model_type": "minilm",
  "new_pos_ids": false,
  "no_cuda": false,
  "no_segment_embedding": false,
  "num_attention_heads": 12,
  "num_hidden_layers": 12,
  "num_qkv": 0,
  "num_training_epochs": 10,
  "num_training_steps": 108000,
  "num_warmup_steps": 500,
  "output_dir": "first",
  "per_gpu_train_batch_size": 16,
  "random_prob": 0.1,
  "relax_projection": 0,
  "save_steps": 1500,
  "seed": 42,
  "seg_emb": false,
  "server_ip": "",
  "server_port": "",
  "source_type_id": 0,
  "target_type_id": 1,
  "task_idx": null,
  "tokenizer_name": null,
  "train_file": "./xsum.train.json",
  "type_vocab_size": 2,
  "vocab_size": -1,
  "weight_decay": 0.01
}

Traceback (most recent call last):
  File "s2s-ft/decode_seq2seq.py", line 296, in <module>
    main()
  File "s2s-ft/decode_seq2seq.py", line 195, in main
    max_position_embeddings=args.max_seq_length, pos_shift=args.pos_shift,
  File "/home/ubuntu/unilm/s2s-ft/s2s_ft/modeling_decoding.py", line 784, in from_pretrained
    model = cls(config, *inputs, **kwargs)
  File "/home/ubuntu/unilm/s2s-ft/s2s_ft/modeling_decoding.py", line 1359, in __init__
    self.bert = BertModelIncr(config)
  File "/home/ubuntu/unilm/s2s-ft/s2s_ft/modeling_decoding.py", line 924, in __init__
    super(BertModelIncr, self).__init__(config)
  File "/home/ubuntu/unilm/s2s-ft/s2s_ft/modeling_decoding.py", line 865, in __init__
    self.embeddings = BertEmbeddings(config)
  File "/home/ubuntu/unilm/s2s-ft/s2s_ft/modeling_decoding.py", line 261, in __init__
    config.vocab_size, config.hidden_size)
  File "/home/ubuntu/anaconda3/envs/unilm/lib/python3.6/site-packages/torch/nn/modules/sparse.py", line 97, in __init__
    self.weight = Parameter(torch.Tensor(num_embeddings, embedding_dim))
RuntimeError: Trying to create tensor with negative dimension -1: [-1, 768]
[1]    14852 exit 1     python s2s-ft/decode_seq2seq.py --fp16 --model_type minilm --tokenizer_name

$ ls first cached_features_for_training.pt ckpt-15000 ckpt-30000 ckpt-45000 ckpt-60000 ckpt-75000 ckpt-90000
ckpt-100500 ckpt-16500 ckpt-31500 ckpt-46500 ckpt-61500 ckpt-76500 ckpt-91500
ckpt-102000 ckpt-18000 ckpt-33000 ckpt-48000 ckpt-63000 ckpt-78000 ckpt-93000
ckpt-103500 ckpt-19500 ckpt-34500 ckpt-49500 ckpt-64500 ckpt-79500 ckpt-94500
ckpt-10500 ckpt-21000 ckpt-36000 ckpt-51000 ckpt-66000 ckpt-81000 ckpt-96000
ckpt-105000 ckpt-22500 ckpt-37500 ckpt-52500 ckpt-67500 ckpt-82500 ckpt-97500
ckpt-106500 ckpt-24000 ckpt-39000 ckpt-54000 ckpt-69000 ckpt-84000 ckpt-99000
ckpt-108000 ckpt-25500 ckpt-40500 ckpt-55500 ckpt-70500 ckpt-85500 config.json
ckpt-12000 ckpt-27000 ckpt-42000 ckpt-57000 ckpt-72000 ckpt-87000 train_opt.json
ckpt-13500 ckpt-28500 ckpt-43500 ckpt-58500 ckpt-73500 ckpt-88500
ckpt-1500 ckpt-3000 ckpt-4500 ckpt-6000 ckpt-7500 ckpt-9000

wenhui0924 commented 4 years ago

Hi @johnyoonh,

It seemed that the MODEL_PATH is not set correctly. Please change the MODEL_PATH to first/ckpt-108000.

Thanks