vietai / ViT5

MIT License
59 stars 9 forks source link

What tokenizer use to token dataset ? #5

Closed batman-do closed 2 years ago

batman-do commented 2 years ago

I want to know tokenizer used to preprocess data to train viT5 (vncorenlp, underthesea,...), Thanks guys !

justinphan3110 commented 2 years ago

Hi @Dean98AI , we build our own sentence piece tokenizer that can be loaded with

model_size = "base" # or large
tokenizer = AutoTokenizer.from_pretrained(f"VietAI/vit5-{model_size}")  
batman-do commented 2 years ago

Hi @justinphan3110 , so you don't use word segment to train ? My question is about preprocess data to train ( use word segment (vncorenlp, underthesea ,...? or don't use and use raw text), thank u guys

heraclex12 commented 2 years ago

HI @Dean98AI,

We didn't use any word segmentation tools in preprocessing data. Raw texts were tokenized directly by our Sentencepiece tokenizer.

batman-do commented 2 years ago

Thanks u guys !