check with wordlist to ignore certain suggestions

Chirag-v09 commented 2 years ago

Is there a way to provide the language tool with a list of words that should NOT be marked as mistakes? I have a lot of technical terms in my data that are wrongly corrected when automatically applying the suggestions of the language tool.

jxmorris12 commented 2 years ago

@Chirag-v09 yes! This is totally doable. You should be able to follow my guidance from the readme under "apply a custom list of matches":

>>> s = "Department of medicine Colombia University closed on August 1 Milinda Samuelli"
>>> is_bad_rule = lambda rule: rule.message == 'Possible spelling mistake found.' and len(rule.replacements) and rule.replacements[0][0].isupper()
>>> import language_tool_python
>>> tool = language_tool_python.LanguageTool('en-US')
>>> matches = tool.check(s)
>>> # The following line could filter out the matches to solve your problem
>>> matches = [m for m in matches if is_good_rule(m)]
>>> matches = [rule for rule in matches if not is_bad_rule(rule)]
>>> language_tool_python.utils.correct(s, matches)
'Department of medicine Colombia University closed on August 1 Melinda Sam'

The previous code filters out matches based on some function is_good_rule which only returns True if you want to apply that suggestion to the text. So you could implement is_good_rule to return False if you're wrongly collecting those technical terms. Does that make sense?

Chirag-v09 commented 2 years ago

Can you define the is_good_rule function? So that I can get more understanding of it.

jxmorris12 commented 2 years ago

It's a function you would write that takes in a rule and returns True if you want to apply it to the text and False otherwise. Here's an example that only accepts spelling mistakes:

>>> s = "Department of medicine Colombia University closed on August 1 Milinda Samuelli"
>>> is_good_rule = lambda rule: rule.message == 'Possible spelling mistake found.' and len(rule.replacements) and rule.replacements[0][0].isupper()
>>> import language_tool_python
>>> tool = language_tool_python.LanguageTool('en-US')
>>> matches = tool.check(s)
>>> matches = [rule for rule in matches if is_good_rule(rule)]
>>> language_tool_python.utils.correct(s, matches)
'Department of medicine Colombia University closed on August 1 Melinda Sam'

Chirag-v09 commented 2 years ago

Hey, Thanks for the update but I need more clarification. For ex:

s = "Hello! Department of medicine Colombiya Universitii"

Here I know "Colombiya" and "Universitii" are wrong words. Still, I don't want spelling mistakes in "Colombiya" (this should be added to the dictionary or ignore spelling mistakes for this word). Still, I want spelling mistakes to come in "Universitii".

jxmorris12 commented 2 years ago

This is just an example @Chirag-v09. is_good_rule is any function that takes a rule and returns true or false. So you just need to write a function that can express the filtering rule you want: which rules should be dropped, and which should be applied. If you want more help, you'll have to provide me more detail on your problem setup.

jxmorris12 / language_tool_python

check with wordlist to ignore certain suggestions #64