Closed shamweelm closed 1 year ago
Hi,
First of all, a sidenote:
Ran train.py for 1000 steps with masking_type as rand_char since we want a character level tokenizer.
The masking type and the tokenization aren't related like this; the tokenizer is always character level. The different masking strategies mask in different ways, but the "unit" being processed by the model is always single characters. Span masking is just a harder task; masking out only a single character is usually very easy to guess from context if the nearby characters aren't also masked.
Are we missing any steps in the pretraining that needs to be done for other languages? Or any changes to the finetune_livedoor_classification except the dataset and the pretrained model?
I don't see any obvious missing steps from what you describe, however the devil is in the details with stuff like this. I would start by making sure your pretraining is working properly - specifically that:
We only ran on Japanese, so it's possible that something weird is happening with Arabic pretraining, but checking the above two things should tell you that. Fine-tuning to do sentiment analysis should be pretty straightforward as long as the pretrained model is working properly, although again I would double check there that nothing is going wrong with processing input/output labels since the model's output doesn't appear to be changing at all.
If you can pinpoint a specific problem I can try and offer more help, but I don't have enough information to point to anything specific right now.
Got it. Thank you for the detailed explanation. We will continue the pretraining for more steps. Will reopen the issue if I face any troubles.
We have currently pre-trained the model on an Arabic dataset and then finetuned it on sentiment analysis data. The recall appears to be a constant of 0.5 as seen in the below screenshot.
We have done the following steps for pretraining:
Are we missing any steps in the pretraining that needs to be done for other languages? Or any changes to the finetune_livedoor_classification except the dataset and the pretrained model?