Open murtuzamdahod opened 3 years ago
Unfortunately, no. In current state spello operates on unigrams and can't perform any n-gram normalisation We would gladly accept such an enhancement.
The latter part should be some what easy - with some modifications you can get the internal vocabulary of the model and then exclude words with a simple filter. spello does not do it by default because it might end up deleting some important context (domain-specific jargons, etc)
Thank you for your response. Then maybe I can build an ngram model on top of it.
The major issue is that words like "panii puri" should be corrected to "pani puri" / "panipuri" as per the context. But it gives me "panini puri". I have trained spello on my dataset of around 20 lakh rows (3 lakh unique).
Interesting, Two questions:
panii
occur in your train set, if yes what is the count?pani puri
occur at least once in your train set?because at the moment spello does not attempt to correct a word if it is in the vocabulary of the trained model. That is also one of the enhancements which would be welcome. https://github.com/hellohaptik/spello#future-scope--limitations Fixing grammatical mistakes and replacing legit words with contextually sensible words would definitely require more intelligence.
If panii
does not occur in your training set, then that is surely a bug and we would like to fix it.
If possible, maybe you can provide us with only sentences that contain panii
, pani
, puri
so we might try re-producing.
"panii " should not be there in my vocabulary because then only it gets corrected to "panini". But I am sure I don't have "panini puri" in my dataset :P So as per the context, "panii puri" should be "pani puri".
Earlier, I was just using regexp to map the ngrams that I require manually. Now, I will need to look into building an n-gram model and see if it works well.
Does this model takes care of ngrams like "hot dog" = "hotdog", "ice cream" = "icecream"??
I have these ngrams in my training data
Also what if i want to remove words which are not corrected and not even in my vocabulary? For eg:
IN : "Cheese hot dog abcd" OUT: "Cheese hotdog"