octanove / shiba

Pytorch implementation and pre-trained Japanese model for CANINE, the efficient character-level transformer.
Other
90 stars 14 forks source link

Pretraining and Finetuning on Arabic Corpus #10

Closed shamweelm closed 1 year ago

shamweelm commented 1 year ago

We have currently pre-trained the model on an Arabic dataset and then finetuned it on sentiment analysis data. The recall appears to be a constant of 0.5 as seen in the below screenshot.

We have done the following steps for pretraining:

  1. Converted txt files to jsonl using to_examples.py
  2. Ran train.py for 1000 steps with masking_type as rand_char since we want a character level tokenizer.
  3. Ran convert_lm_checkpoint.py to convert checkpoints to pt model.
  4. Ran the finetuning script for semantic analysis using finetune_livedoor_classification.py as a reference with our dataset.

Are we missing any steps in the pretraining that needs to be done for other languages? Or any changes to the finetune_livedoor_classification except the dataset and the pretrained model?

Screenshot 2023-05-20 at 1 37 20 AM
Mindful commented 1 year ago

Hi,

First of all, a sidenote:

Ran train.py for 1000 steps with masking_type as rand_char since we want a character level tokenizer.

The masking type and the tokenization aren't related like this; the tokenizer is always character level. The different masking strategies mask in different ways, but the "unit" being processed by the model is always single characters. Span masking is just a harder task; masking out only a single character is usually very easy to guess from context if the nearby characters aren't also masked.

Are we missing any steps in the pretraining that needs to be done for other languages? Or any changes to the finetune_livedoor_classification except the dataset and the pretrained model?

I don't see any obvious missing steps from what you describe, however the devil is in the details with stuff like this. I would start by making sure your pretraining is working properly - specifically that:

  1. Your loss during pretraining is steadily going down, indicating that pretraining is working properly
  2. Your pretrained LM model makes reasonable guesses for masked characters - before trying to fine tune, you should make sure that the pretrained model has actually learned something useful

We only ran on Japanese, so it's possible that something weird is happening with Arabic pretraining, but checking the above two things should tell you that. Fine-tuning to do sentiment analysis should be pretty straightforward as long as the pretrained model is working properly, although again I would double check there that nothing is going wrong with processing input/output labels since the model's output doesn't appear to be changing at all.

If you can pinpoint a specific problem I can try and offer more help, but I don't have enough information to point to anything specific right now.

shamweelm commented 1 year ago

Got it. Thank you for the detailed explanation. We will continue the pretraining for more steps. Will reopen the issue if I face any troubles.