homebrewltd / llama3-s

Llama3.1 learns to Listen
134 stars 4 forks source link

feat: Re-generate dataset with updated tokenizer #42

Closed tikikun closed 1 week ago

tikikun commented 2 weeks ago

We will change tokenizer for the next version of the updated model

tikikun commented 2 weeks ago

hi @bachvudinh per your concern, please download and further expand the training set with this data

https://huggingface.co/datasets/facebook/multilingual_librispeech

bachvudinh commented 2 weeks ago

the dataset are uploaded on the HF here: https://huggingface.co/datasets/homebrewltd/raw-speech-whispervq-v2. and on internal s3: myminio/data/FB_multilingual_librispeech/ cc @tikikun

tikikun commented 1 week ago

it's done