kwonmha / bert-vocab-builder

Builds wordpiece(subword) vocabulary compatible for Google Research's BERT
226 stars 47 forks source link

Not accurate sub-words for German #5

Open maggieezzat opened 5 years ago

maggieezzat commented 5 years ago

I tried using the vocab builder on the German Wikipedia, but some words aren't accurately represented into its sub words, for example, "eintausendneunhundertneunzig" is considered as a one sub word, although I expected "ein", "tausend", "neun", "hundert", "neun", "zig", is there any tweaks to make the model more specific to German which is very compound? Thank you

kwonmha commented 5 years ago

Hi, sorry for late reply. It's been 4 months..

First, it seems you need to check each subwords like 'ein', 'tausend' are in your vocab. And if there are, the word 'eintausendneunhundertneunzig' may appeared many times, higher than threshold set by min_count flags.