Closed bjtommychen closed 5 years ago
Thanks. one more question. The SubWords system will convert 'hello world' into '_he llo _w or ld' Is kenlm tools still be suitable for this? and for subword, kenlm will treat it as character-based or not? Thanks.
The tokens are anything separated by whitespace. Where you choose to put whitespace is your choice / problem.
--discount_fallback
should be all you need to deal with finite vocabulary. It will still generate<unk>
though.