Closed hlncrg closed 2 years ago
Hello Helen,
any raw text is fine, as is one sentence per line. subword-nmt will internally extract the vocabulary (the list of word types and their frequencies) from the text file, and create an internal representation of each word that includes .
In subword_nmt/tests/data/
, you can find an example how an input file (corpus.en
) can look, along with the output of learn_bpe with 1000 merge operations (bpe.ref
) and apply_bpe (corpus.bpe.ref.en
) for this file.
I am trying to understand the format for the input file. What I have is a new sentence on each line. Am I suppose to have a \</w> at the end of every word in the file?