google-research / lasertagger

Apache License 2.0
606 stars 91 forks source link

MSR Summarization Example #11

Open mnc29 opened 4 years ago

mnc29 commented 4 years ago

Can you provide detailed example for MSR dataset. I run into same problem as here https://github.com/google-research/lasertagger/issues/5#issuecomment-582414346

I don't understand how to tokenise this dataset. Can it be tokenised with FullTokenizer in BERT?

Please help if you can.

ekQ commented 4 years ago

If you use BERT FullTokenizer, you should get reasonable predictions, but I'm not sure if there's a detokenizer for FullTokenizer (in particular for the punctuation). This means that when you compute e.g. ROUGE scores, the numbers will not be fully comparable to the numbers we report since your n-grams would consist of smaller units.

Instead, you could try e.g. the Moses tokenizer and detokenizer. So first split all training, development, and test sources and targets into tokens, and then use LaserTagger to run vocabulary optimization and the other steps listed here. Finally, detokenize the predicted token sequences and compute the metrics against the original untokenized test targets.

wentinghome commented 4 years ago

run into the same problem. I'm wondering if there is a BERT compatible tokenization method Wordpiece to tokenize and detokenize the source files? In this case, the tokens can match the vocab provided by pretrained ckpt.

I'm wondering if using other methods such as Mose, is that possible the generated tokens are not in the same format of Wordpiece ? If they are not in the same format, the bert pretrained weight cannot be fully utilized.

Any suggestions and comments are highly appreciated. Thank you.

ekQ commented 4 years ago

Hi Wenting. These are valid concerns but what we've noticed in practice is that small discrepancies in the tokenization are not a problem, especially if you have a decent amount of data to finetune your model. E.g. the DiscoFuse data has been tokenized using a Google Cloud NLP model, which handles punctuation slightly differently than BERT FullTokenizer, but the model still works well with DiscoFuse (even in low-resource settings).

So as long as you use a tokenizer that only separates punctuation (and is able to do detokenization after producing the predictions), you should be fine.