mammothb / symspellpy

Python port of SymSpell: 1 million times faster spelling correction & fuzzy search through Symmetric Delete spelling correction algorithm
MIT License
798 stars 123 forks source link

adding new terms and typical workflow #24

Closed fahadshery closed 5 years ago

fahadshery commented 5 years ago

Hi,

I am trying to add new terms by: ` initial_capacity = 83000

maximum edit distance per dictionary precalculation

max_edit_distance_dictionary = 0
prefix_length = 7`

sym_spell = SymSpell(initial_capacity, max_edit_distance_dictionary, prefix_length) if not sym_spell.load_dictionary(dictionary_path, term_index, count_index): print("Dictionary file not found") return

sym_spell.create_dictionary_entry("steama", 4) sym_spell.create_dictionary_entry("steamb", 6) sym_spell.create_dictionary_entry("steamc", 2)

result = sym_spell.lookup("streama", 2) print(result)

I am getting an empty []. What am I missing?

Additionally, could you provide a skeleton code on how to feed it a text file and it creates a new column of corrected text please? This will help massively in my text analysis.

Much appreciated

mammothb commented 5 years ago

You initialized SymSpell with max_edit_distance_dictionary=0 which means it's only looking for exact matches.

In your lookup("streama", 2), 2 is read as the verbosity argument instead of max_edit_distance, I assume that's what you were trying to do.

For the snippet you provided, the following options work:

import os.path
import sys
from symspellpy import SymSpell, Verbosity

initial_capacity = 83000 # maximum edit distance per dictionary precalculation
max_edit_distance_dictionary = 2
prefix_length = 7

sym_spell = SymSpell(initial_capacity, max_edit_distance_dictionary, prefix_length)

dictionary_path = path/to/frequency_dictionary_en_82_765.txt
term_index = 0
count_index = 1
if not sym_spell.load_dictionary(dictionary_path, term_index, count_index):
    print("Dictionary file not found")

sym_spell.create_dictionary_entry("steama", 4)
sym_spell.create_dictionary_entry("steamb", 6)
sym_spell.create_dictionary_entry("steamc", 2)

result = sym_spell.lookup("streama", Verbosity.ALL)
for r in result:
    print(r)

Expected output:

stream, 1, 38592422
streams, 1, 8882706
steama, 1, 4
steam, 2, 11141309
scream, 2, 4310000
streak, 2, 3268695
strata, 2, 1590600
screams, 2, 1286631
streaks, 2, 731843
steamy, 2, 602955
streamed, 2, 505032
streamer, 2, 432893
streaky, 2, 110522
steams, 2, 87057
strega, 2, 55454
steamb, 2, 6
steamc, 2, 2

Or, choose a smaller max_edit_distance than what is defined during object creation:

import os.path
import sys
from symspellpy import SymSpell, Verbosity

initial_capacity = 83000 # maximum edit distance per dictionary precalculation
max_edit_distance_dictionary = 2
prefix_length = 7

sym_spell = SymSpell(initial_capacity, max_edit_distance_dictionary, prefix_length)

dictionary_path = path/to/frequency_dictionary_en_82_765.txt
term_index = 0
count_index = 1
if not sym_spell.load_dictionary(dictionary_path, term_index, count_index):
    print("Dictionary file not found")

sym_spell.create_dictionary_entry("steama", 4)
sym_spell.create_dictionary_entry("steamb", 6)
sym_spell.create_dictionary_entry("steamc", 2)

result = sym_spell.lookup("streama", Verbosity.ALL, max_edit_distance=1)
for r in result:
    print(r)

Expected output:

stream, 1, 38592422
streams, 1, 8882706
steama, 1, 4

Code for correcting words in a text file:

import os.path
import sys
from symspellpy import SymSpell, Verbosity

initial_capacity = 83000 # maximum edit distance per dictionary precalculation
max_edit_distance_dictionary = 2
prefix_length = 7

sym_spell = SymSpell(initial_capacity, max_edit_distance_dictionary, prefix_length)

dictionary_path = path/to/frequency_dictionary_en_82_765.txt
term_index = 0
count_index = 1
if not sym_spell.load_dictionary(dictionary_path, term_index, count_index):
    print("Dictionary file not found")

sym_spell.create_dictionary_entry("steama", 4)
sym_spell.create_dictionary_entry("steamb", 6)
sym_spell.create_dictionary_entry("steamc", 2)

corrected_words = []
cwd = os.path.realpath(os.path.dirname(sys.argv[0]))
with open(os.path.join(cwd, "input_words.txt"), "r") as infile:
    for word in infile:
        word = word.rstrip()
        results = sym_spell.lookup(word, Verbosity.TOP)
        if not results:
            corrected_words.append((word, word))
        else:
            corrected_words.append((word, results[0].term))

with open(os.path.join(cwd, "output_words.txt"), "w") as outfile:
    for (original_word, corrected_word) in corrected_words:
        outfile.write("{} {}\n".format(original_word, corrected_word))

result = sym_spell.lookup("nopossiblereplacement", Verbosity.ALL)
for r in result:
    print(r)

Input text file input_words.txt which contains a misspelled word which is correctable, a properly spelled word, and a misspelled work which is uncorrectable:

streama
steama
nopossiblereplacement

Expected output output_words.txt:

streama stream
steama steama
nopossiblereplacement nopossiblereplacement

Hope that helps.

fahadshery commented 5 years ago

This is super cool. I will give it a try and report back. Thank you so much for taking the time and providing the code. This will make my life easier.

One last thing: How do I prioritise a certain replacement? Example: if the misspelled word is intnet. I want it to be replaced by "internet" and not "intent"?

Cheers

mammothb commented 5 years ago

Since the results are sorted based on edit_distance and count(frequency), you could try to build your own dictionary which contains different count compared to the one provided with this package.

fahadshery commented 5 years ago

perfect, thanks. initial_capacity = 83000 what is this and why the magical number initial_capacity = 83000

mammothb commented 5 years ago

It was supposed to have some speed up effect by initializing the Dictionary with approximately the number of words in the provided dictionary frequency_dictionary_en_82_765.txt.

It has no effect in the current python package yet.

fahadshery commented 5 years ago

thanks. learnt quite a lot in this thread. infact, just implemented my spell checker today. cant thank you enough. Please go ahead and close this.

If possible, we can add the following section in the home "README.md" Adding new terms: you can add either in the frequency_dict.txt file or by sym_spell.create_dictionary_entry()

spellchecking in a txt file

This will help people who are just starting with this.