Closed konklone closed 10 years ago
To compare, this is the "before" for #110 and this is the "after".
It looks better to me! (A mess either way, but a smaller one.) So I'm happy to merge #110.
It seems like the best approach would be to extract HTML is what Readability and Instapaper did for smart content extraction. Some resources on that:
Also, here's something I just found: https://github.com/deanmalmgren/textract
It's trending on GitHub this week. The docs are good, and they say they use:
Seems like it might be more or less drop-in for our purposes, and gives us .doc
support, as @divergentdave suggested in #113.
The bad news is textract-0.5.1 only works on Python 2, so we can't use it as of yet
Edit: Moreover, textract depends on PIL, which is Python 2 only! There is a PIL fork that supports Python 3, though.
Bummer. I guess its command-line version could be installed system-wide, like pdftotext
or pdftk
, but that seems dicey given the possible incompatibility issues.
Maybe we could just steal its HTML extraction code!
Possibly, though we would need a public domain dedication. On Aug 11, 2014 9:17 PM, "Eric Mill" notifications@github.com wrote:
Bummer. I guess its command-line version could be installed system-wide, like pdftotext or pdftk, but that seems dicey given the possible incompatibility issues.
Maybe we could just steal its HTML extraction code!
— Reply to this email directly or view it on GitHub https://github.com/unitedstates/inspectors-general/issues/91#issuecomment-51871899 .
Well, their implementation is actually very small, and adapted from this StackOverflow answer. StackOverflow licenses everything as CC-BY-SA, with attribution required. Their attribution guidelines don't discuss reusing actual code snippets, but I think this whole thing is small enough to be de minimis. I've certainly swiped some SO answer code and tossed it in a public domain project before.
If anyone wants to tackle it, go for it, but the textract code that was pointed out wouldn't do the Readability/Instapaper-style smart extraction of meaningful content. So right now, unless we were to perform surgery on one of the 4 resources I listed above about that, we're at an impasse. We improved the quality through @divergentdave's work in #110, so this is close-able for me.
This gist has an example of what the current HTML text extraction code (which uses BeautifulSoup) produces:
https://gist.github.com/konklone/38eab2f8e5649631b8f9
It's not good, and could probably be improved. Not sure how offhand, but documenting this here.
Referenced report metadata: