Open maggieezzat opened 5 years ago
Hi, sorry for late reply. It's been 4 months..
First, it seems you need to check each subwords like 'ein', 'tausend' are in your vocab.
And if there are, the word 'eintausendneunhundertneunzig' may appeared many times, higher than threshold set by min_count
flags.
I tried using the vocab builder on the German Wikipedia, but some words aren't accurately represented into its sub words, for example, "eintausendneunhundertneunzig" is considered as a one sub word, although I expected "ein", "tausend", "neun", "hundert", "neun", "zig", is there any tweaks to make the model more specific to German which is very compound? Thank you