utterworks / fast-bert

Super easy library for BERT based NLP models
Apache License 2.0
1.85k stars 342 forks source link

Vocab size while fine tuning language model #223

Open Sagar1094 opened 4 years ago

Sagar1094 commented 4 years ago

Hi, I used around 8000000 text sentences while fine tuning the language model but the newly added vocabulary size is only 50000. My data have atleast around 1000000-2000000 tokens to be added. Can, I explicitly change the vocab size while fine tuning? Thanks

krannnn commented 4 years ago

@Sagar1094 can you please share the code that you are using for lm fine tuning ? thanks

Sagar1094 commented 4 years ago

Hi, I have followed the tutorial for the same. Regards, Sagar

On Wed, Jun 10, 2020, 5:06 PM krannnn notifications@github.com wrote:

@Sagar1094 https://github.com/Sagar1094 can you please share the code that you are using for lm fine tuning ? thanks

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/kaushaltrivedi/fast-bert/issues/223#issuecomment-641943784, or unsubscribe https://github.com/notifications/unsubscribe-auth/ANALH756XKDACY35YAOXHRDRV5V3LANCNFSM4NNUHYQQ .

Sagar1094 commented 4 years ago

Hi, My data is little different, I have indian addresses for example "i 32 mangol puri delhi", "b-8/205 rohini delhi", "kormangalam bengaluru". I want to create a address classifier. These addresses have labels assosiated to them as well like "26-0", "23-2".

Using BERT pre trained I think it is impossible to train this kind of data as most of the words would be out of vocab. Can you please help me and suggest an alternative approach. I have tried training a bert, electra, roberta models from scratch with huge size of vocab - 2800000 words but it is failing. So i tried fine tuning fast-bert which aslo dosent work.

Please help 🙏😊 Regards, Sagar Gupta +91 8826361028

On Wed, Jun 10, 2020, 5:12 PM Sagar Gupta sagarg7154@gmail.com wrote:

Hi, I have followed the tutorial for the same. Regards, Sagar

On Wed, Jun 10, 2020, 5:06 PM krannnn notifications@github.com wrote:

@Sagar1094 https://github.com/Sagar1094 can you please share the code that you are using for lm fine tuning ? thanks

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/kaushaltrivedi/fast-bert/issues/223#issuecomment-641943784, or unsubscribe https://github.com/notifications/unsubscribe-auth/ANALH756XKDACY35YAOXHRDRV5V3LANCNFSM4NNUHYQQ .