Training on Arabic language

lecidhugo commented 3 years ago

Hello, Is there any document or guide on how to train on Arabic ? Is this possible ? if yes what are the requirements ?

Thanks in advance,

kermitt2 commented 3 years ago

Hello @lecidhugo !

You can create the resources for a new language with https://github.com/kermitt2/grisp The readme describes the process. It's an Hadoop process that is going to take a few hours.

Once done, you can start an environment for Arabic with entity-fishing, the knowledge base will be automatically build. Then you need to train a ranker and a selector model as described here -> https://nerd.readthedocs.io/en/latest/train.html#training-with-wikipedia

Loading the markupFull is the DB that is time consuming, it stores all the article text content.

You don't need to create embeddings if I remember well, it should work without them. However it improves a bit the disambiguation. This is also quite time consuming (it should be half day for Arabic given the number of articles).

There are 1,080,907 articles in Arabic, so it's a pretty big number, it should be doable and provide decent results.

lecidhugo commented 3 years ago

Thank you very much for your kind reply! I will try to do it

kermitt2 commented 2 years ago

Note that Arabic is now supported by default, with already trained models and KB resources available, see the documentation.

kermitt2 / entity-fishing

Training on Arabic language #115