❓ Why CNN_DM data is already tokenized ?

microsoft / glge

Code for ACL2021 paper: "GLGE: A New General Language Generation Evaluation Benchmark"

Other

58 stars 7 forks source link

❓ Why CNN_DM data is already tokenized ? #6

Closed astariul closed 3 years ago

astariul commented 3 years ago

When downloading the data, it appears the CNN_DM dataset is already tokenized (by StanfordNLP tokenizer).

Shouldn't the data be available in raw (untokenized) format ?
Because each architecture have their own different way to tokenize data ?

For example, in the HuggingFace datasets package, CNN_DM data is not tokenized.

dayihengliu commented 3 years ago

We follow ProphetNet, BART, UniLM to preprocess the CNN_DM with StanfordNLP tokenizer. One can further use other tokenizers like BERT-cased/BERT-uncased/GPT tokenizer to further tokenize this dataset. We appreciate your suggestion and will consider adding RAW Format CNN_DM to the released benchmark.