Closed astariul closed 3 years ago
We follow ProphetNet, BART, UniLM to preprocess the CNN_DM with StanfordNLP tokenizer. One can further use other tokenizers like BERT-cased/BERT-uncased/GPT tokenizer to further tokenize this dataset. We appreciate your suggestion and will consider adding RAW Format CNN_DM to the released benchmark.
When downloading the data, it appears the CNN_DM dataset is already tokenized (by StanfordNLP tokenizer).
Shouldn't the data be available in raw (untokenized) format ?
Because each architecture have their own different way to tokenize data ?
For example, in the HuggingFace datasets package, CNN_DM data is not tokenized.