mit-han-lab / lite-transformer

[ICLR 2020] Lite Transformer with Long-Short Range Attention
https://arxiv.org/abs/2004.11886
Other
596 stars 81 forks source link

Missing Data Preparation section for the CNN / DailyMail dataset #28

Closed cronopioelectronico closed 3 years ago

cronopioelectronico commented 3 years ago

Hi, in the README file there are instructions to prepare the other datasets, but they are missing for the CNN / DailyMail dataset. Since you are providing the checkpoint for this case, It would be great if you can include the data preparation instructions too. Thanks.

Michaelvll commented 3 years ago

Thank you for asking! For convenient, we download the cnn/dm dataset using the Tensorflow/tensor2tensor. Then please try out the commands below to prepare the binary dataset.

#!/bin/bash

TEXT=data/cnn_daily_t2t
TRUNC=1000
fairseq-preprocess --source-lang source --target-lang target \
    --trainpref $TEXT/cnndm.train.$TRUNC --validpref $TEXT/cnndm.dev.$TRUNC --testpref $TEXT/cnndm.test.$TRUNC \
    --destdir data/binary/cnndm_t2t_30k_$TRUNC \
    --workers 20 --joined-dictionary