Prepare WMT'17 Datasets

dennybritz commented 7 years ago

We should prepare datasets for All WMT'17 language pairs. This is also a change to try out google/sentencepiece as a preprocessor.

Each dataset should come in different configurations, i.e. different vocabulary sizes and also have a character-level version.

Together with the raw data files we also need the script that was used for the process.

MaksymDel commented 7 years ago

WMT17 organisers already published preprocessed version of the data: link to data. The scripts used for preprocessing are included.

However, they have tried to keep the pre-processing fairly 'light touch' (only Moses standard preprocessing).

It could be a good unified starting point for us here.

dennybritz commented 7 years ago

This is great, yes. I will make preparing the datasets a lot easier.

dennybritz commented 7 years ago

I just looked at the data and you can use the following script to process any pair using https://github.com/google/sentencepiece:

#! /usr/bin/env bash
# Dependencies:
#   - https://github.com/google/sentencepiece

CORPUS_DIR=$(pwd)
SOURCE_LANG="en"
TARGET_LANG="de"
VOCBA_SIZE=32000

# Learn BPE across both corpora
spm_train \
  --input=${CORPUS_DIR}/corpus.tc.${SOURCE_LANG},${CORPUS_DIR}/corpus.tc.${TARGET_LANG} \
  --model_prefix=${CORPUS_DIR}/bpe \
  --vocab_size=32000 \
  --model_type=bpe

# Apply BPE to corpus
spm_encode --model=${CORPUS_DIR}/bpe.model --output_format=piece \
    < ${CORPUS_DIR}/corpus.tc.${SOURCE_LANG} \
    > ${CORPUS_DIR}/corpus.tc.bpe.${SOURCE_LANG}
spm_encode --model=${CORPUS_DIR}/bpe.model --output_format=piece \
    < ${CORPUS_DIR}/corpus.tc.${TARGET_LANG} \
    > ${CORPUS_DIR}/corpus.tc.bpe.${TARGET_LANG}

# Apply BPE to all dev data
for lang in ${SOURCE_LANG} ${TARGET_LANG}; do
  for infile in $(find ${CORPUS_DIR}/dev | grep tc.${lang}); do
    echo $infile
    outfile="${infile%.*}.bpe.${lang}"
    spm_encode --model=${CORPUS_DIR}/bpe.model --output_format=piece < $infile > $outfile
    echo $outfile
  done
done

google / seq2seq

Prepare WMT'17 Datasets #21