WhatWordWhere: Progress & Next Steps

boblannon commented 10 years ago

Other interested parties: @jsfenfen @lukerosiak

Summary: Using the x,y coordinates of a scanned form's fields, treat the search for interesting regions of a document as a geo-search.

With an idea of where a field is expected to be, we can do fuzzy queries that for that field across multiple documents, extracting isolated areas to be digitized individually using off-the-shelf OCR (tesseract, etc).

Method: One output of tesseract gives us bounding boxes of the text that it found, expressed as x, y coordinates. Selecting the bounding boxes that represent a field of interest, we can record its location and look for boxes that are roughly similar in other documents for which we also have bounding-box information.

So Far:

using direct database queries, we've been able to prove this strategy by extracting at least one field across many 990s.

To Do:

normalize coordinate systems with the output of poppler's pdftotext (demo here: https://github.com/pdfliberation/pdf_table_extraction/tree/master/visuals)
write django-style filters for fuzzy bounding box matches
basic graphical UI for easier selection of areas and instant feedback based on selections
data entry! data quality checking!

jsfenfen commented 10 years ago

FWIW there are milestones that roughly describe the next two possible activities: adding a gui, either from Bob's d3 library of my raphael one (bob's is probably the better choice); standardizing the backend/frontend communication to geojson format whenever possible; and doing more testing on the specific 990 data. Data quality works on the 1,000 doc sample linked from the repo, but there are likely lurking issues elsewhere. There's a hunk of django-independent geojson serialization stuff that needs to be committed that hopefully I'll do in the next few days.

lukerosiak commented 10 years ago

This is amazing.

On Mon, Feb 10, 2014 at 9:43 PM, Jacob Fenton notifications@github.comwrote:

FWIW there are milestoneshttps://github.com/jsfenfen/whatwordwhere/issues/milestonesthat roughly describe the next two possible activities: adding a gui, either from Bob's d3 library of my raphael one (bob's is probably the better choice); standardizing the backend/frontend communication to geojson format whenever possible; and doing more testing on the specific 990 data. Data quality works on the 1,000 doc sample linked from the repo, but there are likely lurking issues elsewhere. There's a hunk of django-independent geojson serialization stuff that needs to be committed that hopefully I'll do in the next few days.

Reply to this email directly or view it on GitHubhttps://github.com/pdfliberation/assembly/issues/3#issuecomment-34721664 .

dcloud commented 10 years ago

I'm curious what the Django portion is providing. Is it the support for PostGIS that's most valuable? Is there a sense that we can or should divorce the app from Django? I guess I'm thinking it would be great to examine how this tool could fit in a chain of tools (tessract->whatwordwhere->???) and how modular we can make it.

@jsfenfen can we have the milestones and tickets on pdfliberation/whatwordwhere rather than your personal copy?

I do like both ideas of working on the GUI and using geojson.

boblannon commented 10 years ago

from what i've read/heard, geodjango is a particularly friendly ORM for PostGIS. I don't think there's any threat to modularity to use it, especially if it helps us write more maintainable code. in the end, things are stored in PostGIS, and can be queried directly if the ORM comes up short or if we make new decisions about the stack

jsfenfen commented 10 years ago

@dcloud, @boblannon : Many of the 'pieces' of the repo are themselves not dependent on django: the part that reads the hocr and turns it into a python data structure, and the part that turns that data structure into geojson are both django independent. I'll commit an example script that turns hocr into geojson without django, and if it's particularly useful we can roll that into it's own repo. As of now, the clearest indication is just in the files themselves.

FWIW, django's ORM is explicitly ignored during loading operations because it's just faster to use bulk operations rather than instantiating django objects.

Besides tying documents / pages to the django ORM, the utility of django is really the GEOS bindings. So any geographic call made by python should actually be hitting the GEOS code. One could sub out another set of GEOS bindings, though I definitely don't plan to do that.

In general I think of this as running in two modes: one where the data is stored in a db (call it 'exploratory' mode) for creating page classifiers and data extractors and testing them, and another in 'bulk' mode, where the data is never actually loaded into a db but the extraction routines are run against the GEOS objects one at a time. There's a lotta good reasons to prefer loading everything into a database, it's just the reality of having 50 million pages that makes it annoying to do so.

The down side of the two mode operation is that filters will need to be applied in two ways: as a postgis query and as a query against GEOS objects just in memory. I imagine the filters will be pretty simple, so this won't really be an issue, but...

jsfenfen commented 10 years ago

@boblannon @dcloud - See this demo.

pdfliberation / assembly

WhatWordWhere: Progress & Next Steps #3