tensorflow / tensor2tensor

Library of deep learning models and datasets designed to make deep learning more accessible and accelerate ML research.
Apache License 2.0
15.3k stars 3.47k forks source link

Summarization bad inference #1537

Open TMA95 opened 5 years ago

TMA95 commented 5 years ago

Description

I am trying to create a model that is able to summarize, using the CNN/Daily Mail dataset. Training looks okay; loss gradually goes down. Internal validation metrics are (too) high (see graphs), but during inference the decodings and evaluation metrics are bad.

I've tried using test data during internal validation, which also gave me really high ROUGE scores. The problem thus is not in the data itself, and the model is also not overfit. So why are the results so bad during inference?

Looking at the loss function I would say the model is learning sth, which the internal validation metrics also would suggest. But this is simply not the case. Could it be a bug in the t2t-decoder?

Model: transformer; hparams: transformer_prepend; batch_size: 4096; prepend_mode: none I trained on 2 1080ti's for 1M steps.

Environment information

OS: Ubuntu 18.04

tensor2tensor==1.12.0 tensorflow==1.12.0 python 3.6.8

For bugs: reproduction and error logs

loss_cnndm eval_metrics_cnndm

ROUGE during inference: Rouge-1: 20.11 / Rouge-2: 4.29 / Rouge-L: 18.23

agemagician commented 5 years ago

Hi @TMA95 ,

I am actually having the same problem with another dataset.

I think there is a bug in the t2t-decoder. Check: https://github.com/tensorflow/tensor2tensor/issues/1408

@lukaszkaiser This is exactly the same issue that I have. Hopefully, you will fix this issue soon.

agemagician commented 5 years ago

Thanks for the update, I will use OpenNMT as you recommended.

TMA95 commented 5 years ago

For me it does work, but the performance then is just really bad, even after long training time. I switched to OpenNMT, this works fine for me.

Good luck!

Op wo 17 apr. 2019 om 10:06 schreef Ahmed Elnaggar <notifications@github.com

:

Apparently, the problem is in teacher-forcing, during validation it uses the ground truth to predict the next step. This gives false performance matrices.

Unfortunately, the "eval_run_autoregressive" which should give correct values doesn't work with transformer model:

613 https://github.com/tensorflow/tensor2tensor/issues/613

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/tensorflow/tensor2tensor/issues/1537#issuecomment-483981837, or mute the thread https://github.com/notifications/unsubscribe-auth/Aiz-9AwHpcjKjUNWa5SXOAAlYON6sNTgks5vhtYdgaJpZM4cmgG4 .

christian-git-md commented 5 years ago

Not sure this was the issue here, but keep in mind that when you give the wrong path to your output directory, the weights will be initialized randomly, which led to some confusion for me.

domyounglee commented 4 years ago

I don't think there is a problem in decoder . It worked fine with me with those parameters

PROBLEM=summarize_cnn_dailymail32k MODEL=transformer HPARAMS=transformer_prepend

t2t-trainer \ --data_dir=$DATA_DIR \ --problem=$PROBLEM \ --model=$MODEL \ --hparams_set=$HPARAMS \ --output_dir=$TRAIN_DIR \ --worker_gpu=4 \ --hparams='batch_size=4096,prepend_mode=none,max_input_seq_length=512,max_target_seq_length=100' \ --train_steps=200000 \ --keep_checkpoint_max=10 \ --local_eval_frequency=5000 \ --eval_steps=30 \ --eval_run_autoregressive=False

jmalvinez commented 4 years ago

@domyounglee I followed your suggestion but the output I got isn't good. It is not repetitive or copying from the input anymore but the summary generated seems to be not related to the input and the target. Below is an example.

INPUT: Department invited schools to submit applications for extra security measures on their premises. . 'The Australian Government has committed $18 million over three years for the new Schools Security Programme. It is designed to provide funding to government and non–government schools and preschools assessed as being at risk of attack, harassment or violence stemming from racial or religious intolerance,' the Attorney-General's office said. . 'The programme will provide non-recurrent funding for security infrastructure, such as closed-circuit television (CCTV) systems, lighting and fences, and for the cost of employing security guards,' it added. . Schools across Australia requested extra security assistance. Of those approved, 29 are in NSW and a large number of them are in Western Sydney. Victoria had the next-highest number of schools approved, with 15. Half the schools are Islamic or Jewish. The funding announcement comes after youths allegedly shouted out threats to kill Christians and waved an Islamic State flag as they drove past a Maronite church and school in Sydney's west last year, terrifying members of the community. NSW police investigated the incident at the Our Lady of Lebanon church and the Maronite College of the Holy Family in Harris Park in Sydney's west in September. It is alleged a youth in a passing red hatchback yelled out: 'We're going to kill all you Christians.' Also in September last year, a Jewish school in Sydney’s eastern suburbs constructed a bomb proof wall. Federal Justice Minister Michael Keenan announced 54 schools have been approved for help

OUTPUT: More than 1,000 people have been killed in Afghanistan since 2009. More than 1,000 people have been killed in the past year.

TARGET: Federal government to spend $18m over three years on school security. 54 schools were selected after submitting requests in September. Half the schools are Islamic and Jewish; 29 are in NSW and 15 in Victoria . The schools are 'at risk of attack, harassment or violence stemming from racial or religious intolerance' CCTV systems, lighting, fences and security guards will be funded.

The loss calculated during training never seems to go down below 5 too. Any ideas what I could be doing wrong? Thanks!

domyounglee commented 4 years ago

@jmalvinez
I ran it on 4 GPUS(gtx 1080Ti) with the learning rate 0.1 Make sure the batch size is "number of tokens per batch per GPU" where the meaning is different from the conventional meaning of " batch size " Could you tell me the dataset you trained and the size of the dataset?