Open AlexandderGorodetski opened 2 weeks ago
hi alex,
i’m not sure about the duplication part, but i feel like it wont be necessary to duplicate Lang2 text to match the number of lines of Lang1.
best jin
On May 2, 2024, at 16:55, AlexandderGorodetski @.***> wrote:
Hello guys,
I have to train multi lingual model using my inhouse data. I have 10K hours for Lang1 and 5K hours for Lang2.
I wanted to ask you about BPE algorithm. Because Lang1 has 2 times more data, therefore I guess that I have to duplicate textual data of Lang2 two times so that number of tokens from Lang1 will be approximately same like in Lang2.
And of course that I will increase number of tokens from 500 to 1000.
Is all this correct?
Thanks a lot, AlexG.
— Reply to this email directly, view it on GitHub https://github.com/k2-fsa/icefall/issues/1612, or unsubscribe https://github.com/notifications/unsubscribe-auth/AOON42FBIYCDOORW6M4CZW3ZAH5O3AVCNFSM6AAAAABHDJR5I2VHI2DSMVQWIX3LMV43ASLTON2WKOZSGI3TIOJWGIZTOOA. You are receiving this because you are subscribed to this thread.
Hello guys,
I have to train multi lingual model using my inhouse data. I have 10K hours for Lang1 and 5K hours for Lang2.
I wanted to ask you about BPE algorithm. Because Lang1 has 2 times more data, therefore I guess that I have to duplicate textual data of Lang2 two times so that number of tokens from Lang1 will be approximately same like in Lang2.
And of course that I will increase number of tokens from 500 to 1000.
Is all this correct?
Thanks a lot, AlexG.