matthiaslmz / BERT-Punctuation-Restoration

7 stars 3 forks source link

Re-train the model for other languages #1

Open thaitrinh opened 3 years ago

thaitrinh commented 3 years ago

Hi Matthias,

is it possible to have a minimum example of how to re-train the model for other languages? How does the training data look like? What are the "labels"? Could you please explain the main idea?

Thank you very much!

matthiaslmz commented 3 years ago

Hello @thaitrinh,

I suppose you are looking to pre-train BERT's LM for another language? If so, that means you'd like to pre-train BERT from scratch and this repo I have is created for fine-tuning on downstream tasks (specifically, classification tasks like Punctuation Restoration) and not intended for pre-training BERT from scratch. If you'd like to re-train BERT for other languages, I suggest you to look at this solid example by HuggingFace here.

thaitrinh commented 3 years ago

Hi Matthias,

Thanks for your reply! I want to used a pre-trained BERT german model to fine-tune on a downstream task (punctuation restoration in German language). A pre-trained BERT german language model can be downloaded from HuggingFace.

My question is: once I have downloaded the pre-trained language model (BERT German), how can I utilize your repo to fine-tune the model on punctuation restoration? Could you please show a small example of how to run the code for fine-tuning? and maybe also a small example of how the preprocessed data look like (before feeding them into the fine-tuning process)?

Many thanks and best wishes!