jbesomi / texthero

Text preprocessing, representation and visualization from zero to hero.
https://texthero.org
MIT License
2.89k stars 239 forks source link

Add hero.infer_lang(s) #3

Open jbesomi opened 4 years ago

jbesomi commented 4 years ago

(Edit)

Add a function hero.infer_lang(s) (a suggestion for a better function name is more than welcomed!) that given a Pandas Series finds for each row the respective language.

Implementation

  1. Probably, we will need to define an _infer_lang function inside the mother function that will take as input a text and return the lang of the text. Then infer_lang will just apply it to s.
  2. Searching in Google for infer language python might be a good start.
  3. There are probably two ways to solve: rule-based and model-based. If rule-based has high accuracy, then we can stick to this solution as it's probably faster.
  4. Add it under nlp.py

Improvement

A more complex solution would not only return the lang but also the probability or even better a dictionary like this one (or similar):

{
   'en': 0.8, 
   'fr': 0.1, 
   'es': 0.0
}

Expected PR

  1. Should motivate the reason of a chosen algorithm or external library
  2. Explain how complex would it be to add the "improvement"
  3. Show a concrete example on a large and multilingual dataset or at least give proof that it works well (for example citing that under-the-hood the function use package X that achieved Y accuracy on ...)
  4. All other requirements as stated in CONTRIBUTING.md
tmankita commented 4 years ago

Hi, I would like to pick up this issue if no one is working on it yet. Just to be sure, the tasks are

jbesomi commented 4 years ago

Hi @tmankita, nice to hear from you! Thank you for your comment; I realized that the issue wasn't well explained. I added some comments, please have a look and tell us if you think you can work on this :)

Opinion: do you think it's better infer_lang or infer_langs? Honestly do not know which one is best ...

tmankita commented 4 years ago

@jbesomi Sounds grate. My game plan is:

jbesomi commented 4 years ago

Dear Tomer, your plan sounds perfect to me! Looking forward to seeing what you will come up with. All right then ... let's go for infer_lang :)

tmankita commented 4 years ago

(Edited) Dear Jonathan, @jbesomi After doing some research, I present to you my findings:

I found a StackOverflow response (https://stackoverflow.com/a/47106810) that summary all open source libraries that deal with inference language, so I do a performance test for all those libraries.

Dataset: (From http://www.statmt.org/europarl/) 21,000 senteces
21 languages : English, Bulgarian, Czech, Danish, German, Greek, Spanish,Estonian,Finnish, French, Hungarian, Italian, Lithuanian, Latvian, Dutch, Polish, Portuguese, Romanian Slovak, Slovene, Swedish

+----------------+----------------+------------------+ | library | average time | accuracy score | +================+================+==================+ | spaCy | 0.01438 sec | 86.181% | +----------------+----------------+------------------+ | langid | 0.00107 sec | 86.100% | +----------------+----------------+------------------+ | langdetect | 0.00606 sec | 86.308% | +----------------+----------------+------------------+ | fasttext | 0.02765 sec | 84.433% | +----------------+----------------+------------------+ | cld3 | 0.00026 sec | 84.973% | +----------------+----------------+------------------+ | guess_language | 0.00079 sec | 75.481% | +----------------+----------------+------------------+

Failed: TextBlob - server request based, can't handle multiply requests in parallel. polyglot - didn't succeed to install this library on my local machine (mac). Chardet - failed to detect the language in most of the sentences. succeed but have an issue: LangDetect- In multiply examples, it raised an Error "No features in the text."

Finally, I suggest using the LangId (need python =<3.6) or spaCy library, which has the best performance for our uses. According to the improvement, I will suggest using FastText or cld3 library, because they can compute the K most frequent languages (get K as a parameter) of the example with their probabilities. What do you think about my suggestions?

Dataset: sentences.all.zip

tmankita commented 4 years ago

Another solution to the improvement, use spaCy and spacy-Langdetect that I used on the dataset. The spacy-Langdetect extends the Langdetect library and has a function that gets a list of probabilities for match languages. @jbesomi I will start to implement this solution with the improvement, what do you think?

jbesomi commented 4 years ago

Hey Tomer, wow, amazing job!

Texthero makes extensive use of spaCy under the hoods (see for instance the functions under nlp). By looking at the results and your suggestions, I agree with you spaCy is probably the best choice for our use case.

jbesomi commented 4 years ago

Hey Tomer,

Sorry for my late replay. Another simple alternative might be to use Dask dataframe and just s.apply(langdetect). Would you like to test this option too and see how fast it is?

Also, can you please benchmark the importance of the length of a sentence, if for instance we limit the number of words to 100/200/300 (s.str.split().str[:100]) how much faster it is?

tmankita commented 4 years ago

Hi Jonathan,

Good to hear from you, according to the Dask.dataframe, in our current implementation we using pandas.Series as an input and not pandas.dataframe. Do you think DataFrame is better for our needs? how it speeds up the process?

jbesomi commented 4 years ago

Hi Tomer, you are right: we probably want to keep accept as input a Series. If Dask does not allow to work with Series, we can, under the hood, just transform the Series into a DF (with one single column), apply the apply Dask function on the Dataframe, and at the end transform it back to a Series. Going back and forth from Series to DF should not be too expensive I guess (but need to be tested)