k2-fsa / icefall

https://k2-fsa.github.io/icefall/
Apache License 2.0
792 stars 267 forks source link

Multi Lingual model #1612

Open AlexandderGorodetski opened 2 weeks ago

AlexandderGorodetski commented 2 weeks ago

Hello guys,

I have to train multi lingual model using my inhouse data. I have 10K hours for Lang1 and 5K hours for Lang2.

I wanted to ask you about BPE algorithm. Because Lang1 has 2 times more data, therefore I guess that I have to duplicate textual data of Lang2 two times so that number of tokens from Lang1 will be approximately same like in Lang2.

And of course that I will increase number of tokens from 500 to 1000.

Is all this correct?

Thanks a lot, AlexG.

JinZr commented 2 weeks ago

hi alex,

i’m not sure about the duplication part, but i feel like it wont be necessary to duplicate Lang2 text to match the number of lines of Lang1.

best jin

On May 2, 2024, at 16:55, AlexandderGorodetski @.***> wrote:

Hello guys,

I have to train multi lingual model using my inhouse data. I have 10K hours for Lang1 and 5K hours for Lang2.

I wanted to ask you about BPE algorithm. Because Lang1 has 2 times more data, therefore I guess that I have to duplicate textual data of Lang2 two times so that number of tokens from Lang1 will be approximately same like in Lang2.

And of course that I will increase number of tokens from 500 to 1000.

Is all this correct?

Thanks a lot, AlexG.

— Reply to this email directly, view it on GitHub https://github.com/k2-fsa/icefall/issues/1612, or unsubscribe https://github.com/notifications/unsubscribe-auth/AOON42FBIYCDOORW6M4CZW3ZAH5O3AVCNFSM6AAAAABHDJR5I2VHI2DSMVQWIX3LMV43ASLTON2WKOZSGI3TIOJWGIZTOOA. You are receiving this because you are subscribed to this thread.