jbesomi / texthero

Text preprocessing, representation and visualization from zero to hero.
https://texthero.org
MIT License
2.88k stars 240 forks source link

Add correct_mistakes(s) #14

Open jbesomi opened 4 years ago

jbesomi commented 4 years ago

Or at least check how many mistakes in a sentence.

See: https://pypi.org/project/pyenchant/

selimelawwa commented 4 years ago

@jbesomi I have check the library. We can create a function count_mistakes which can return number of mistakes per sentence.

For correcting mistakes, the library has a method suggest(word) which returns list of suggestions for the given word. We can have a method correct_mistakes that by default chooses the first word in the suggestions and replace the incorrect word with it? Do you have another suggestion for this?

jbesomi commented 4 years ago

Good idea. *return number of mistakes per pandas Series-row.

selimelawwa commented 4 years ago

Ok but what about correct mistakes?

jbesomi commented 4 years ago

As you proposed is fine. Only thing, before going with pyenchant, would be great to select 2/3 similar package, test and rank them and finally implement count_mistakes and correct_mistakes.

selimelawwa commented 4 years ago

Hi, I checked and these are the alternative options:

These sources claim SymSpell should be the best in terms of performance (time):

With SymSpell We can implement automatic_correct_mistakes but will be a bit more complicated than PyEnchant.

Please check and let me know your opinion.

jbesomi commented 4 years ago

Great. Both sources do not cite and do not benchmark pyenchant. Probably, we should test ourself both pyenchant and symspellpy both for quality of results and execution time and pick the best. In the end, we might decide to pick both and let the user decide. In this case, we would need anyways a benchmarking to understand which ones work best in which situation. What's your opinion Selim?

selimelawwa commented 4 years ago

Sorry for late reply, We had holidays here in Egypt after Ramadan. Yeah I think we should test both too to be able to determine ourselves which is better and for which use case. However how do you suggest testing for the quality on of result for large data? I will start on them from tomorrow, keep you updated

jbesomi commented 4 years ago

No problem; thank you for your help! For the performance comparison, just pick a large NLP dataset and compare the execution time. For quality, I guess you need to look at the results yourself and decide.