Closed peeter-t2 closed 4 years ago
Hi,
I updated the readme file because it was misleading. You need to pass use_freq=False to extract_parallel, then it should work.
If you have use_freq=True (this is the default), you can pass a dictionary with corpus frequencies and then you can specify the minimum word frequency to be considered by setting the min_frequency parameter. The seedwords parameter would in this case be {"the":10000, "and":2000}_, in other words, it is a dictionary where the keys are words and values are corpus frequencies.
If you like the repo, please give it a star on GitHub 🙂
Hi, that was quick! Problem no more, thanks! Will be experimenting with it more. :)
Running test examples, it seems to work very well, except there seems to be a problem using a set here. I'm probably just using it wrong, so advice is helpful.
seed_words = set(["logic", "logical"]) #list of correctly spelled words you want to find matching OCR errors for dictionary = wiktionary #Lemmas of the English Wiktionary, you will need to change this if working with any other language lemmatize = True #Uses Spacy with English model, use natas.set_spacy(nlp) for other models and languages
results = ocr_builder.extract_parallel(seed_words, model, dictionary=dictionary, lemmatize=lemmatize)
I get the error, TypeError: 'set' object is not subscriptable
Any idea what might be going on? Thanks!