Closed Demfier closed 5 years ago
Hi,
No this is the SentencePiece segmentation algorithm which produces ~5K subwords for open vocabulary generation.
@ozancaglayan - Thanks for the response. I think I understand now - SentencePiece doesn't build a word-level vocab. Instead, it uses subword algorithms to build a vocab that arguably supports open vocab generation.
Thanks again :smile:
Hi,
It is mentioned in the paper that a SentencePiece vocab of size 5K was created for both, English and Portuguese. So was something like
max_length
was set for the sentences or did you use all the sentences and replaced the OOV words with<unk>
token?Thanks in advance! Gaurav.