propublica / transcribable

Drop in crowdsourcing for your Rails app. Extracted from Free the Files.
MIT License
84 stars 4 forks source link

Computer vision thoughts? #2

Open fgregg opened 11 years ago

fgregg commented 11 years ago

Hi @ashaw,

In your post on casino driven design, you mentioned that you had been thinking about computer vision for transcribable. Do you have any of your thoughts public anywhere? I've been thinking about working on this issue.

ashaw commented 11 years ago

Hi @fgregg.. I haven't published anything concrete about it outside of that blog post, but we have a few things we'd love Transcribable to eventually handle:

One thing that would be super useful, to start, would be an automatic way for divvying up documents based on similar looking documents, so users would always be assigned the same kind of document.

For example, in our Free the Files project, TV stations would use various contract templates. One looks like: https://projects.propublica.org/free-the-files/filings/25219. Another may look like: https://projects.propublica.org/free-the-files/filings/33965. If we could give user A only documents of the first type, they could transcribe them faster.

Another thing we've wanted to do is include a classifier that would learn over time where certain data points are in each form type, and selectively either assign the data within those boxes to be transcribed or OCR'd. So, for example, once the machine was trained, a certain user would just start getting assigned buyer names from stations that used a specific form type. This would vastly speed up the number of tasks a user could do, cutting down on time it takes to hunt around for a specific data point.

Let me know if you'd like to work on either of these ideas, or if you have other ideas.

Thanks, Al

fgregg commented 11 years ago

How were you thinking about the architecture for this stuff? We are probably going to want to use opencv for many of these tasks. There's a nascent ruby wrapper for opencv https://github.com/ruby-opencv, but the python interface is much more developed. Personally, I'm a python guy.

On Thu, Sep 5, 2013 at 9:05 AM, Al Shaw notifications@github.com wrote:

Hi @fgregg https://github.com/fgregg.. I haven't published anything concrete about it outside of that blog post, but we have a few things we'd love Transcribable to eventually handle:

One thing that would be super useful, to start, would be an automatic way for divvying up documents based on similar looking documents, so users would always be assigned the same kind of document.

For example, in our Free the Fileshttps://projects.propublica.org/free-the-files/project, TV stations would use various contract templates. One looks like: https://projects.propublica.org/free-the-files/filings/25219. Another may look like: https://projects.propublica.org/free-the-files/filings/33965. If we could give user A only documents of the first type, they could transcribe them faster.

Another thing we've wanted to do is include a classifier that would learn over time where certain data points are in each form type, and selectively either assign the data within those boxes to be transcribed or OCR'd. So, for example, once the machine was trained, a certain user would just start getting assigned buyer names from stations that used a specific form type. This would vastly speed up the number of tasks a user could do, cutting down on time it takes to hunt around for a specific data point.

Let me know if you'd like to work on either of these ideas, or if you have other ideas.

Thanks, Al

— Reply to this email directly or view it on GitHubhttps://github.com/propublica/transcribable/issues/2#issuecomment-23869837 .

773.888.2718 2231 N. Monticello Ave Chicago, IL 60647

ashaw commented 11 years ago

I've never worked with opencv, but preferably I'd like to keep as much in Ruby as possible. For table detection, we may want to use (or take inspiration from) a library one of my colleagues, @jeremybmerrill, has worked on: tabula-extractor.

This seems particularly useful: https://github.com/jazzido/tabula-extractor/blob/master/lib/tabula/table_guesser.rb

jazzido commented 11 years ago

Depending on OpenCV is a deployment nightmare (at least that was our experience when developing tabula and tabula-extractor), it's just difficult to install for normal users.

Manuel Aristarán http://jazzido.com

On Thu, Sep 5, 2013 at 12:19 PM, Al Shaw notifications@github.com wrote:

I've never worked with opencv, but preferably I'd like to keep as much in Ruby as possible. For table detection, we may want to use (or take inspiration from) a library one of my colleagues, @jeremybmerrillhttps://github.com/jeremybmerrill, has worked on: tabula-extractorhttps://github.com/jazzido/tabula-extractor .

— Reply to this email directly or view it on GitHubhttps://github.com/propublica/transcribable/issues/2#issuecomment-23875542 .

fgregg commented 11 years ago

Yeah, but it's hard for me to imagine doing most of the things without opencv (or substantially reimplementing it). Suggest that maybe a silo approach. Communicate to document cloud and transcribable through a web api.

jeremybmerrill commented 11 years ago

As Manuel (@jazzido) says, ruby-opencv is a nightmare. It's an hours-long slog to install on any given platform. tabula-extractor uses JRuby and the JavaCV port of OpenCV. This is far easier to deal with, though it forces you onto JRuby and off of the MRI.

tabula-extractor's table_guesser.rb assumes that tables have perfectly straight and perfectly oriented lines, since it assumes its input hasn't been through the noisy step of being printed and scanned. Adjusting it away from this assumption into detecting "regions" would likely be possible and within the scope of an ancillary project, but a decent amount of work.

Manuel'd be a better expert on this than I, but there's significant academic work into this problem of locating tables on OCRed pages. He may be able to recommend a few introductory papers.

fgregg commented 11 years ago

Is there a documentcloud api search that will get me examples of free the files pdfs?

jazzido commented 11 years ago

tabula-extractor uses JRuby and the JavaCV port of OpenCV. This is far easier to deal with, though it forces you onto JRuby and off of the MRI.

Not anymore. Depending on such a huge lib just for detecting lines wasn't worth the hassle. We're now using a small C library called LSD that we link to using Ruby's foreign function interface library.

w.r.t problem of locating tables in scanned documents, that's a whole field of research. keywords: document segmentation, document analysis, table detection, etc.

jeremybmerrill commented 11 years ago

Oops, lol, forgot about that switch. Nevertheless, javacv is an option. On Sep 5, 2013 5:28 PM, "Manuel Aristarán" notifications@github.com wrote:

tabula-extractor uses JRuby and the JavaCV port of OpenCV. This is far easier to deal with, though it forces you onto JRuby and off of the MRI.

Not anymore. Depend on such a huge lib just for detecting lines wasn't worth the hassle. We're now using a small C library called LSD that we link to using Ruby's foreign function interface library.

w.r.t problem of locating tables in scanned documents, that's a whole field of research. keywords: document segmentation, document analysis, table detection, etc.

— Reply to this email directly or view it on GitHubhttps://github.com/propublica/transcribable/issues/2#issuecomment-23903119 .

ashaw commented 11 years ago

@fgregg Re: the documentcloud search, try this: https://www.documentcloud.org/api/search/contributedto:%20%22freethefiles%22

knowtheory commented 11 years ago

Just a thought, would it be worth looking at the Leptonica library which Tesseract depends upon?

I haven't tried installing it independently or using it yet tho.