mikahama / natas

Python 3 library for processing historical English
Apache License 2.0
64 stars 11 forks source link

TypeError: 'set' object is not subscriptable in ocr_builder.extract_parallel() #3

Closed peeter-t2 closed 4 years ago

peeter-t2 commented 4 years ago

Running test examples, it seems to work very well, except there seems to be a problem using a set here. I'm probably just using it wrong, so advice is helpful.

seed_words = set(["logic", "logical"]) #list of correctly spelled words you want to find matching OCR errors for dictionary = wiktionary #Lemmas of the English Wiktionary, you will need to change this if working with any other language lemmatize = True #Uses Spacy with English model, use natas.set_spacy(nlp) for other models and languages

results = ocr_builder.extract_parallel(seed_words, model, dictionary=dictionary, lemmatize=lemmatize)

I get the error, TypeError: 'set' object is not subscriptable

Any idea what might be going on? Thanks!

image

mikahama commented 4 years ago

Hi,

I updated the readme file because it was misleading. You need to pass use_freq=False to extract_parallel, then it should work.

If you have use_freq=True (this is the default), you can pass a dictionary with corpus frequencies and then you can specify the minimum word frequency to be considered by setting the min_frequency parameter. The seedwords parameter would in this case be {"the":10000, "and":2000}_, in other words, it is a dictionary where the keys are words and values are corpus frequencies.

If you like the repo, please give it a star on GitHub 🙂

peeter-t2 commented 4 years ago

Hi, that was quick! Problem no more, thanks! Will be experimenting with it more. :)