james-bowman / nlp

Selected Machine Learning algorithms for natural language processing and semantic analysis in Golang
MIT License
451 stars 45 forks source link

Ocr #1

Closed joeblew99 closed 7 years ago

joeblew99 commented 7 years ago

This looks really nice. Thank you for putting this open.

I am attempting to do OCR. I can identify all the letters, but then i need to check them against a word list so i can pick up where the OCR has maybe made a mistake.

This way it can then propagate back to the OCR system to get better.

There is no reason also why it can't use semantic meaning of a sentence e to also correct the OCR. It's kind of one step up from just using single words.

I don't have it up on a git repo yet, but figured it would be interesting to you. If you feel like commenting about this idea it would be great.

I am really curious too where you get data sources . For semantic you need training data right ?

james-bowman commented 7 years ago

Sounds like an interesting project! Are you typically using OCR on documents within a specific domain or is it more generalised? If the former you might be able to compare the processed documents to those previously processed and look for unusual topics and/or outliers.

Regarding training data, there are three primary models used in LSA that each need fitting: vocabulary used for vectorisation, inverse document frequencies used for reducing the weighting of common terms and the truncated eigenvector of terms used as part of SVD. The model used for SVD is best fitted to the corpus itself (or at least a sample of it) rather than separate training data because the model relates directly to the documents within the corpus. The inverse document frequencies and vocabulary could be fitted either to separate training data or to the corpus (or a sample of the corpus if the corpus is extremely large). There are benefits and drawbacks to both:. Fitting to the corpus will yield the best fit for the corpus at that point in time. However, it may be over fitted in terms of not catering well for new documents added after fitting. Conversely using separate training data may create a more generalised fit that caters fairly well for new documents added after fitting but does not fit the original corpus quite as well, e.g. the vocabulary fitted to training data may be missing words present in the corpus. Currently for my use case of looking for similar and related articles on the web, I fit all models directly to the corpus. This does mean that new documents added after fitting may not be perfectly catered for e.g. terms missing from corresponding feature vector, weightings not correctly representative of frequency within corpus, etc. Over time, the models will require refitting at intervals to cater for new documents added to the corpus since the last fitting.

I would love to hear other people's experiences and thoughts in this area.

joeblew99 commented 7 years ago

Hey James. @james-bowman.

In the end for the OCR i needed a quick solution and so ended up using the Google our vision API.

Thanks for the considered response. I missed notification .

To answer your questions:

  1. It's a domain specific to biology. I scooped up papers about biology from the internet

  2. What am i looking for ?? I am building a generic matrix of functional techniques that biological things use to solve problems. For example how do certain plants generate light or have the ability for lateral loading with the smallest amount of material used. For each functional aspect i have modelled in the system i have tradeoffs against other aspects. For example if a functional trait us the lateral strength a tradeoff might be that it uses lots of energy to make that material structure. I have these in a rdf document with protégé as my GUI for now so it's easy to cognitively visualise. I am building a golang parser for that rdf format at the moment.

So my main NLP challenge is identifying papers that map to my traits and tradeoffs. It's of course an imprecise thing so i am planning to build a GUI with some scoring aspects TS so biology experts can help with classification.

Programming architecture in general:

One of issue with all of this type of work is that you have a huge amount of data to keep around and you need a good way to index it and search it in general. So i am building a bleve system based on minio and Nats. The golang bleve engine as a general search indexer to help. It will make it much easier to keep track k of the huge amount of data. Minio is a golang based S3 equivalent. NATS is a golang message bus. It's very easy to string it all together. But i need to build a decent GUI for the bleve aspect so researchers / biologists can use it.

james-bowman commented 7 years ago

Sounds really interesting. NATS is very cool and bleve is also really nice. The challenge with conventional search engines is how they solve the synonymy and polysemy problems i.e. different words that mean the same thing or the same word that means different things. Simply searching for word occurances can sometimes return false positives (where the search query words appear into he matched document but was used to mean something different) and missed matches (where a relevant document was not matched because it did not contain the search query words). This is where latent semantic indexing can help. For your particular use case where scientific terms tend to be more precise and specific this might not be so much of a problem.

To check the accuracy of the OCR you could consider using word embeddings. This could help you predict a specific word within a sentence or predict the sentence within which a word might appear. This is an interesting area and one I might try to build out in the NLP library when I get a chance.

Thanks for sharing your thoughts and plans : it sounds really interesting!

joeblew99 commented 7 years ago

thanks for the tips. I have it working with bleve, NATS & mini for now. The NLP aspect from this repo i will try to add now