srvk / how2-dataset

This repository contains code and metadata of How2 dataset
https://srvk.github.io/how2-dataset/
160 stars 17 forks source link

Query related to pre-processing #3

Closed Demfier closed 5 years ago

Demfier commented 5 years ago

Hi,

It is mentioned in the paper that a SentencePiece vocab of size 5K was created for both, English and Portuguese. So was something like max_length was set for the sentences or did you use all the sentences and replaced the OOV words with <unk> token?

Thanks in advance! Gaurav.

ozancaglayan commented 5 years ago

Hi,

No this is the SentencePiece segmentation algorithm which produces ~5K subwords for open vocabulary generation.

Demfier commented 5 years ago

@ozancaglayan - Thanks for the response. I think I understand now - SentencePiece doesn't build a word-level vocab. Instead, it uses subword algorithms to build a vocab that arguably supports open vocab generation.

Thanks again :smile: