kpu / preprocess

Corpus preprocessing
Other
95 stars 21 forks source link

truecaser not identical to perl script #10

Open kpu opened 4 years ago

kpu commented 4 years ago

On input -> the Moses truecase script does - > but the C++ does ->. The additional space seems to appear regardless of what is before >.

kpu commented 4 years ago

But the tokenizer is supposed to change those to < and > so it probably doesn't matter. (XML support is out of scope for the C++ version)