Open jbesomi opened 4 years ago
Hi, I would like to pick up this issue if no one is working on it yet. Just to be sure, the tasks are
Hi @tmankita, nice to hear from you! Thank you for your comment; I realized that the issue wasn't well explained. I added some comments, please have a look and tell us if you think you can work on this :)
texthero.nlp.infer_lang
Opinion: do you think it's better infer_lang
or infer_langs
? Honestly do not know which one is best ...
@jbesomi Sounds grate. My game plan is:
We will choose the best algorithm based on my result and create for him PR. What do you think about the plan?
According to the function name,infer_lang
sounds more intuitive for me.
Dear Tomer, your plan sounds perfect to me! Looking forward to seeing what you will come up with.
All right then ... let's go for infer_lang
:)
(Edited) Dear Jonathan, @jbesomi After doing some research, I present to you my findings:
I found a StackOverflow response (https://stackoverflow.com/a/47106810) that summary all open source libraries that deal with inference language, so I do a performance test for all those libraries.
Dataset: (From http://www.statmt.org/europarl/)
21,000 senteces
21 languages : English, Bulgarian, Czech, Danish, German, Greek,
Spanish,Estonian,Finnish, French, Hungarian, Italian,
Lithuanian, Latvian, Dutch, Polish, Portuguese, Romanian
Slovak, Slovene, Swedish
+----------------+----------------+------------------+ | library | average time | accuracy score | +================+================+==================+ | spaCy | 0.01438 sec | 86.181% | +----------------+----------------+------------------+ | langid | 0.00107 sec | 86.100% | +----------------+----------------+------------------+ | langdetect | 0.00606 sec | 86.308% | +----------------+----------------+------------------+ | fasttext | 0.02765 sec | 84.433% | +----------------+----------------+------------------+ | cld3 | 0.00026 sec | 84.973% | +----------------+----------------+------------------+ | guess_language | 0.00079 sec | 75.481% | +----------------+----------------+------------------+
Failed: TextBlob - server request based, can't handle multiply requests in parallel. polyglot - didn't succeed to install this library on my local machine (mac). Chardet - failed to detect the language in most of the sentences. succeed but have an issue: LangDetect- In multiply examples, it raised an Error "No features in the text."
Finally, I suggest using the LangId (need python =<3.6) or spaCy library, which has the best performance for our uses. According to the improvement, I will suggest using FastText or cld3 library, because they can compute the K most frequent languages (get K as a parameter) of the example with their probabilities. What do you think about my suggestions?
Dataset: sentences.all.zip
Another solution to the improvement, use spaCy and spacy-Langdetect that I used on the dataset. The spacy-Langdetect extends the Langdetect library and has a function that gets a list of probabilities for match languages. @jbesomi I will start to implement this solution with the improvement, what do you think?
Hey Tomer, wow, amazing job!
Texthero makes extensive use of spaCy under the hoods (see for instance the functions under nlp
). By looking at the results and your suggestions, I agree with you spaCy is probably the best choice for our use case.
Hey Tomer,
Sorry for my late replay. Another simple alternative might be to use Dask dataframe and just s.apply(langdetect)
. Would you like to test this option too and see how fast it is?
Also, can you please benchmark the importance of the length of a sentence, if for instance we limit the number of words to 100/200/300 (s.str.split().str[:100]
) how much faster it is?
Hi Jonathan,
Good to hear from you, according to the Dask.dataframe, in our current implementation we using pandas.Series as an input and not pandas.dataframe. Do you think DataFrame is better for our needs? how it speeds up the process?
Hi Tomer, you are right: we probably want to keep accept as input a Series
. If Dask does not allow to work with Series, we can, under the hood, just transform the Series into a DF (with one single column), apply the apply
Dask function on the Dataframe, and at the end transform it back to a Series. Going back and forth from Series to DF should not be too expensive I guess (but need to be tested)
(Edit)
Add a function
hero.infer_lang(s)
(a suggestion for a better function name is more than welcomed!) that given a Pandas Series finds for each row the respective language.Implementation
_infer_lang
function inside the mother function that will take as input a text and return the lang of the text. Theninfer_lang
will justapply
it tos
.nlp.py
Improvement
A more complex solution would not only return the
lang
but also the probability or even better a dictionary like this one (or similar):Expected PR
good first issue
label