huggingface / transformers

🤗 Transformers: State-of-the-art Machine Learning for Pytorch, TensorFlow, and JAX.
https://huggingface.co/transformers
Apache License 2.0
132.36k stars 26.36k forks source link

LM fine-tuning for non-english dataset (hindi) #1068

Closed nikhilno1 closed 4 years ago

nikhilno1 commented 5 years ago

❓ Questions & Help

Previously, I made this movie review sentiment classifier app using this wonderful library. (Links: https://deployment-247905.appspot.com/ https://towardsdatascience.com/battle-of-the-heavyweights-bert-vs-ulmfit-faceoff-91a582a7c42b)

Now I am looking to build a language model that will be fine-tuned on Hindi movie songs. Out of the pretrained models I see "bert-base-multilingual-cased" and "xlm-mlm-xnli15-1024" as the ones that I can use (that support hindi language). From what I understand, GPT/GPT-2/Transformer-XL/XLNet are auto-regressive models that can be used for text generation whereas BERT or XLM are trained using masked language models (MLM) so they won't do a good job in text generation. Is that a fair statement?

Anyways, just to play around I modified run_generation.py script to also include XLM. This gave below error:

File "run_generation_xlm.py", line 128, in sample_sequence
    next_token_logits = outputs[0][0, -1, :] / temperature
IndexError: too many indices for tensor of dimension 2

So I simply removed the first index after which it could at least run. next_token_logits = outputs[0][-1, :] / temperature

However the results are lousy:

Model prompt >>> i had lunch
just i-only day cousin from me the the the the me, the the,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,, " ",,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,
Model prompt >>> i had lunch
) could me freaking these prone right so mostly so his f**king i word february our so as made gig february more " tina <special4>and dy f**k r man roll ride ride ride ride ride ride ride ride ride ride ride ride ride ride ride riding riding riding riding riding riding riding riding riding riding riding riding riding riding riding it it how how how i the all all know know and and and and and and and and and and and and and and and and and and and and and and and and and and and and

Questions: 1) Can I use BERT or XLM for automatic text generation? The reason to pick these is coz of availability of pretrained models. 2) Are there instructions available to fine-tune any of the model for non-english datasets? Thanks.

PS: I'm looking for a buddy to work together with in solving such problems. If you are interested please get in touch with me.

LysandreJik commented 5 years ago

Hello! Thanks for showcasing the library in your article!

You are totally correct about the auto-regressive models (XLNet, Transformer-XL, GPT-2 etc). Those models can efficiently predict the next work in a sequence as they attend to the left side of the sequence, usually trained with causal language modeling (CLM).

Using BERT or RoBERTa for text generation won't work as it was trained using a bi-directional context with masked language modeling (MLM). However, XLM has several checkpoints with different training schemes, you can see them here.

Some of them were trained using CLM (see xlm-clm-enfr-1024 and xlm-clm-ende-1024), so they should be able to generate coherent sequences of text.

Unfortunately, if you're reaching for Hindi, you probably won't be able to fine-tune any model to it. To the best of my knowledge, fine-tuning models that were trained on a specific language to other languages does not yield good results.

Some efforts have been done training models from scratch to other languages: see deepset's German BERT or [Morizeyao's chinese GPT-2](https://github.com/Morizeyao/GPT2-Chinese, maybe this could guide you.

Hope that helps.

nikhilno1 commented 5 years ago

Thank you Lysandre for the links. I'll check them out.

So if I understand correctly, I'd need a xlm-clm-enhi-1024 model to use for hindi language. Is that right? These checkpoints I suppose were created by HuggingFace team. Any plans to include other languages (in my case hindi) or share the steps so that we can do it ourselves? That would be a big help. Thanks.

thomwolf commented 5 years ago

Hi @nikhilno1, the checkpoints for XLM were created by the authors of XLM, Guillaume Lample and Alexis Conneau from FAIR.

You should ask on the official XLM repository.

nikhilno1 commented 5 years ago

Oh. When I searched for "xlm-clm-enfr-1024" I only got hits within pytorch-transformers, so I assumed it was created by HF. Thanks, I'll check with the XLM authors.

stale[bot] commented 4 years ago

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.