Improve HTML text extraction

unitedstates / inspectors-general

Collecting reports from Inspectors General across the US federal government.

https://sunlightfoundation.com/blog/2014/11/07/opengov-voices-opening-up-government-reports-through-teamwork-and-open-data/

Creative Commons Zero v1.0 Universal

107 stars 21 forks source link

Improve HTML text extraction #91

Closed konklone closed 10 years ago

konklone commented 10 years ago

This gist has an example of what the current HTML text extraction code (which uses BeautifulSoup) produces:

https://gist.github.com/konklone/38eab2f8e5649631b8f9

It's not good, and could probably be improved. Not sure how offhand, but documenting this here.

Referenced report metadata:

{
  "agency": "education",
  "agency_name": "Department of Education",
  "file_type": "html",
  "inspector": "education",
  "inspector_url": "https://www2.ed.gov/about/offices/list/oig/",
  "published_on": "2013-02-11",
  "report_id": "ca022013",
  "title": "Guilty Pleas in Federal Student Financial Aid Fraud Schemes.  Sacramento, CA., February 11, 2013",
  "type": "report",
  "url": "https://www2.ed.gov/about/offices/list/oig/invtreports/ca022013.html",
  "year": 2013
}

konklone commented 10 years ago

To compare, this is the "before" for #110 and this is the "after".

It looks better to me! (A mess either way, but a smaller one.) So I'm happy to merge #110.

It seems like the best approach would be to extract HTML is what Readability and Instapaper did for smart content extraction. Some resources on that:

konklone commented 10 years ago

Also, here's something I just found: https://github.com/deanmalmgren/textract

It's trending on GitHub this week. The docs are good, and they say they use:

.doc via antiword
.docx via python-docx
.eml via python builtins.
.json via python builtins.
.html via beautifulsoup4
.pptx via python-pptx
.pdf via pdftotext (default) or pdfminer
.txt via python builtins.

Seems like it might be more or less drop-in for our purposes, and gives us .doc support, as @divergentdave suggested in #113.

divergentdave commented 10 years ago

The bad news is textract-0.5.1 only works on Python 2, so we can't use it as of yet

Edit: Moreover, textract depends on PIL, which is Python 2 only! There is a PIL fork that supports Python 3, though.

konklone commented 10 years ago

Bummer. I guess its command-line version could be installed system-wide, like pdftotext or pdftk, but that seems dicey given the possible incompatibility issues.

Maybe we could just steal its HTML extraction code!

divergentdave commented 10 years ago

Possibly, though we would need a public domain dedication. On Aug 11, 2014 9:17 PM, "Eric Mill" notifications@github.com wrote:

Bummer. I guess its command-line version could be installed system-wide, like pdftotext or pdftk, but that seems dicey given the possible incompatibility issues.

Maybe we could just steal its HTML extraction code!

— Reply to this email directly or view it on GitHub https://github.com/unitedstates/inspectors-general/issues/91#issuecomment-51871899 .

konklone commented 10 years ago

Well, their implementation is actually very small, and adapted from this StackOverflow answer. StackOverflow licenses everything as CC-BY-SA, with attribution required. Their attribution guidelines don't discuss reusing actual code snippets, but I think this whole thing is small enough to be de minimis. I've certainly swiped some SO answer code and tossed it in a public domain project before.

konklone commented 10 years ago

If anyone wants to tackle it, go for it, but the textract code that was pointed out wouldn't do the Readability/Instapaper-style smart extraction of meaningful content. So right now, unless we were to perform surgery on one of the 4 resources I listed above about that, we're at an impasse. We improved the quality through @divergentdave's work in #110, so this is close-able for me.