Closed fahadshery closed 5 years ago
You initialized SymSpell
with max_edit_distance_dictionary=0
which means it's only looking for exact matches.
In your lookup("streama", 2)
, 2
is read as the verbosity
argument instead of max_edit_distance
, I assume that's what you were trying to do.
For the snippet you provided, the following options work:
import os.path
import sys
from symspellpy import SymSpell, Verbosity
initial_capacity = 83000 # maximum edit distance per dictionary precalculation
max_edit_distance_dictionary = 2
prefix_length = 7
sym_spell = SymSpell(initial_capacity, max_edit_distance_dictionary, prefix_length)
dictionary_path = path/to/frequency_dictionary_en_82_765.txt
term_index = 0
count_index = 1
if not sym_spell.load_dictionary(dictionary_path, term_index, count_index):
print("Dictionary file not found")
sym_spell.create_dictionary_entry("steama", 4)
sym_spell.create_dictionary_entry("steamb", 6)
sym_spell.create_dictionary_entry("steamc", 2)
result = sym_spell.lookup("streama", Verbosity.ALL)
for r in result:
print(r)
Expected output:
stream, 1, 38592422
streams, 1, 8882706
steama, 1, 4
steam, 2, 11141309
scream, 2, 4310000
streak, 2, 3268695
strata, 2, 1590600
screams, 2, 1286631
streaks, 2, 731843
steamy, 2, 602955
streamed, 2, 505032
streamer, 2, 432893
streaky, 2, 110522
steams, 2, 87057
strega, 2, 55454
steamb, 2, 6
steamc, 2, 2
Or, choose a smaller max_edit_distance
than what is defined during object creation:
import os.path
import sys
from symspellpy import SymSpell, Verbosity
initial_capacity = 83000 # maximum edit distance per dictionary precalculation
max_edit_distance_dictionary = 2
prefix_length = 7
sym_spell = SymSpell(initial_capacity, max_edit_distance_dictionary, prefix_length)
dictionary_path = path/to/frequency_dictionary_en_82_765.txt
term_index = 0
count_index = 1
if not sym_spell.load_dictionary(dictionary_path, term_index, count_index):
print("Dictionary file not found")
sym_spell.create_dictionary_entry("steama", 4)
sym_spell.create_dictionary_entry("steamb", 6)
sym_spell.create_dictionary_entry("steamc", 2)
result = sym_spell.lookup("streama", Verbosity.ALL, max_edit_distance=1)
for r in result:
print(r)
Expected output:
stream, 1, 38592422
streams, 1, 8882706
steama, 1, 4
Code for correcting words in a text file:
import os.path
import sys
from symspellpy import SymSpell, Verbosity
initial_capacity = 83000 # maximum edit distance per dictionary precalculation
max_edit_distance_dictionary = 2
prefix_length = 7
sym_spell = SymSpell(initial_capacity, max_edit_distance_dictionary, prefix_length)
dictionary_path = path/to/frequency_dictionary_en_82_765.txt
term_index = 0
count_index = 1
if not sym_spell.load_dictionary(dictionary_path, term_index, count_index):
print("Dictionary file not found")
sym_spell.create_dictionary_entry("steama", 4)
sym_spell.create_dictionary_entry("steamb", 6)
sym_spell.create_dictionary_entry("steamc", 2)
corrected_words = []
cwd = os.path.realpath(os.path.dirname(sys.argv[0]))
with open(os.path.join(cwd, "input_words.txt"), "r") as infile:
for word in infile:
word = word.rstrip()
results = sym_spell.lookup(word, Verbosity.TOP)
if not results:
corrected_words.append((word, word))
else:
corrected_words.append((word, results[0].term))
with open(os.path.join(cwd, "output_words.txt"), "w") as outfile:
for (original_word, corrected_word) in corrected_words:
outfile.write("{} {}\n".format(original_word, corrected_word))
result = sym_spell.lookup("nopossiblereplacement", Verbosity.ALL)
for r in result:
print(r)
Input text file input_words.txt
which contains a misspelled word which is correctable, a properly spelled word, and a misspelled work which is uncorrectable:
streama
steama
nopossiblereplacement
Expected output output_words.txt
:
streama stream
steama steama
nopossiblereplacement nopossiblereplacement
Hope that helps.
This is super cool. I will give it a try and report back. Thank you so much for taking the time and providing the code. This will make my life easier.
One last thing: How do I prioritise a certain replacement? Example: if the misspelled word is intnet
. I want it to be replaced by "internet"
and not "intent"
?
Cheers
Since the results are sorted based on edit_distance
and count(frequency)
, you could try to build your own dictionary which contains different count
compared to the one provided with this package.
perfect, thanks.
initial_capacity = 83000
what is this and why the magical number initial_capacity = 83000
It was supposed to have some speed up effect by initializing the Dictionary
with approximately the number of words in the provided dictionary frequency_dictionary_en_82_765.txt
.
It has no effect in the current python package yet.
thanks. learnt quite a lot in this thread. infact, just implemented my spell checker today. cant thank you enough. Please go ahead and close this.
If possible, we can add the following section in the home "README.md" Adding new terms: you can add either in the frequency_dict.txt file or by sym_spell.create_dictionary_entry()
spellchecking in a txt file
This will help people who are just starting with this.
Hi,
I am trying to add new terms by: ` initial_capacity = 83000
maximum edit distance per dictionary precalculation
sym_spell = SymSpell(initial_capacity, max_edit_distance_dictionary, prefix_length)
if not sym_spell.load_dictionary(dictionary_path, term_index, count_index):
print("Dictionary file not found")
return
sym_spell.create_dictionary_entry("steama", 4) sym_spell.create_dictionary_entry("steamb", 6) sym_spell.create_dictionary_entry("steamc", 2)
result = sym_spell.lookup("streama", 2) print(result)
I am getting an empty
[]
. What am I missing?Additionally, could you provide a skeleton code on how to feed it a text file and it creates a new column of corrected text please? This will help massively in my text analysis.
Much appreciated