Open dennybritz opened 7 years ago
WMT17 organisers already published preprocessed version of the data: link to data. The scripts used for preprocessing are included.
However, they have tried to keep the pre-processing fairly 'light touch' (only Moses standard preprocessing).
It could be a good unified starting point for us here.
This is great, yes. I will make preparing the datasets a lot easier.
I just looked at the data and you can use the following script to process any pair using https://github.com/google/sentencepiece:
#! /usr/bin/env bash
# Dependencies:
# - https://github.com/google/sentencepiece
CORPUS_DIR=$(pwd)
SOURCE_LANG="en"
TARGET_LANG="de"
VOCBA_SIZE=32000
# Learn BPE across both corpora
spm_train \
--input=${CORPUS_DIR}/corpus.tc.${SOURCE_LANG},${CORPUS_DIR}/corpus.tc.${TARGET_LANG} \
--model_prefix=${CORPUS_DIR}/bpe \
--vocab_size=32000 \
--model_type=bpe
# Apply BPE to corpus
spm_encode --model=${CORPUS_DIR}/bpe.model --output_format=piece \
< ${CORPUS_DIR}/corpus.tc.${SOURCE_LANG} \
> ${CORPUS_DIR}/corpus.tc.bpe.${SOURCE_LANG}
spm_encode --model=${CORPUS_DIR}/bpe.model --output_format=piece \
< ${CORPUS_DIR}/corpus.tc.${TARGET_LANG} \
> ${CORPUS_DIR}/corpus.tc.bpe.${TARGET_LANG}
# Apply BPE to all dev data
for lang in ${SOURCE_LANG} ${TARGET_LANG}; do
for infile in $(find ${CORPUS_DIR}/dev | grep tc.${lang}); do
echo $infile
outfile="${infile%.*}.bpe.${lang}"
spm_encode --model=${CORPUS_DIR}/bpe.model --output_format=piece < $infile > $outfile
echo $outfile
done
done
We should prepare datasets for All WMT'17 language pairs. This is also a change to try out google/sentencepiece as a preprocessor.
Each dataset should come in different configurations, i.e. different vocabulary sizes and also have a character-level version.
Together with the raw data files we also need the script that was used for the process.