Closed batman-do closed 2 years ago
Hi @Dean98AI , we build our own sentence piece tokenizer that can be loaded with
model_size = "base" # or large
tokenizer = AutoTokenizer.from_pretrained(f"VietAI/vit5-{model_size}")
Hi @justinphan3110 , so you don't use word segment to train ? My question is about preprocess data to train ( use word segment (vncorenlp, underthesea ,...? or don't use and use raw text), thank u guys
HI @Dean98AI,
We didn't use any word segmentation tools in preprocessing data. Raw texts were tokenized directly by our Sentencepiece tokenizer.
Thanks u guys !
I want to know tokenizer used to preprocess data to train viT5 (vncorenlp, underthesea,...), Thanks guys !