hellohaptik / spello

Fast and accurate spell correction library
MIT License
74 stars 20 forks source link

Ngrams #13

Open murtuzamdahod opened 3 years ago

murtuzamdahod commented 3 years ago

Does this model takes care of ngrams like "hot dog" = "hotdog", "ice cream" = "icecream"??

I have these ngrams in my training data

Also what if i want to remove words which are not corrected and not even in my vocabulary? For eg:

IN : "Cheese hot dog abcd" OUT: "Cheese hotdog"

chiragjn commented 3 years ago

Unfortunately, no. In current state spello operates on unigrams and can't perform any n-gram normalisation We would gladly accept such an enhancement.

The latter part should be some what easy - with some modifications you can get the internal vocabulary of the model and then exclude words with a simple filter. spello does not do it by default because it might end up deleting some important context (domain-specific jargons, etc)

murtuzamdahod commented 3 years ago

Thank you for your response. Then maybe I can build an ngram model on top of it.

The major issue is that words like "panii puri" should be corrected to "pani puri" / "panipuri" as per the context. But it gives me "panini puri". I have trained spello on my dataset of around 20 lakh rows (3 lakh unique).

chiragjn commented 3 years ago

Interesting, Two questions:

because at the moment spello does not attempt to correct a word if it is in the vocabulary of the trained model. That is also one of the enhancements which would be welcome. https://github.com/hellohaptik/spello#future-scope--limitations Fixing grammatical mistakes and replacing legit words with contextually sensible words would definitely require more intelligence.

If panii does not occur in your training set, then that is surely a bug and we would like to fix it. If possible, maybe you can provide us with only sentences that contain panii, pani, puri so we might try re-producing.

murtuzamdahod commented 3 years ago

"panii " should not be there in my vocabulary because then only it gets corrected to "panini". But I am sure I don't have "panini puri" in my dataset :P So as per the context, "panii puri" should be "pani puri".

Earlier, I was just using regexp to map the ngrams that I require manually. Now, I will need to look into building an n-gram model and see if it works well.