Open bamnet opened 13 years ago
I'm going to have to think about this some more, it looks like all of the extractors are calling String#strip
to remove trailing whitespace. It looks like those characters just represent whitespace of some type and if that is the case I'm fine with coming up with a replacement for String#strip
that grabs these characters too.
When I run PDF tests I get output that looks like this
Textractor returns the contents of pdf documents Failure/Error: Textractor.text_from_path(fixture_path("document.pdf")).should == 'text' expected: "text", got: "text\t\r \302\240 \t\r \302\240" (using ==)
My pdftotext version must handle formatting characters differently from yours. Do you think this is something textractor should handle?
In my use case I never care about the document formatting, I only want strings separated by spaces, with a limited subset of punctuation (aka periods and commas) for use indexing documents. I don't mind handling this functionality in each application, but I'd be glad to write it into texttactor if you think there's value in that.