Ngrams - Githubissues

hellohaptik / spello

Fast and accurate spell correction library

MIT License

74 stars 20 forks source link

Ngrams #13

Open murtuzamdahod opened 3 years ago

murtuzamdahod commented 3 years ago

Does this model takes care of ngrams like "hot dog" = "hotdog", "ice cream" = "icecream"??

I have these ngrams in my training data

Also what if i want to remove words which are not corrected and not even in my vocabulary? For eg:

IN : "Cheese hot dog abcd" OUT: "Cheese hotdog"

chiragjn commented 3 years ago

Unfortunately, no. In current state spello operates on unigrams and can't perform any n-gram normalisation We would gladly accept such an enhancement.

The latter part should be some what easy - with some modifications you can get the internal vocabulary of the model and then exclude words with a simple filter. spello does not do it by default because it might end up deleting some important context (domain-specific jargons, etc)

murtuzamdahod commented 3 years ago

Thank you for your response. Then maybe I can build an ngram model on top of it.

The major issue is that words like "panii puri" should be corrected to "pani puri" / "panipuri" as per the context. But it gives me "panini puri". I have trained spello on my dataset of around 20 lakh rows (3 lakh unique).

chiragjn commented 3 years ago

Interesting, Two questions:

does panii occur in your train set, if yes what is the count?
does pani puri occur at least once in your train set?

because at the moment spello does not attempt to correct a word if it is in the vocabulary of the trained model. That is also one of the enhancements which would be welcome. https://github.com/hellohaptik/spello#future-scope--limitations Fixing grammatical mistakes and replacing legit words with contextually sensible words would definitely require more intelligence.

If panii does not occur in your training set, then that is surely a bug and we would like to fix it. If possible, maybe you can provide us with only sentences that contain panii, pani, puri so we might try re-producing.

murtuzamdahod commented 3 years ago

"panii " should not be there in my vocabulary because then only it gets corrected to "panini". But I am sure I don't have "panini puri" in my dataset :P So as per the context, "panii puri" should be "pani puri".

Earlier, I was just using regexp to map the ngrams that I require manually. Now, I will need to look into building an n-gram model and see if it works well.