rsennrich / subword-nmt

Unsupervised Word Segmentation for Neural Machine Translation and Text Generation
MIT License
2.18k stars 464 forks source link

Expected format of input #99

Closed hlncrg closed 2 years ago

hlncrg commented 3 years ago

I am trying to understand the format for the input file. What I have is a new sentence on each line. Am I suppose to have a \</w> at the end of every word in the file?

rsennrich commented 3 years ago

Hello Helen,

any raw text is fine, as is one sentence per line. subword-nmt will internally extract the vocabulary (the list of word types and their frequencies) from the text file, and create an internal representation of each word that includes .

In subword_nmt/tests/data/, you can find an example how an input file (corpus.en) can look, along with the output of learn_bpe with 1000 merge operations (bpe.ref) and apply_bpe (corpus.bpe.ref.en) for this file.