python-openxml / python-docx

Create and modify Word documents with Python
MIT License
4.57k stars 1.12k forks source link

feature: Document.text #72

Open deanmalmgren opened 10 years ago

deanmalmgren commented 10 years ago

@mikemaccana's old project had a simple script for extracting text from a document. Took me a few minutes to figure it out, but this is really simple now:

document = docx.Document(filename)
return '\n\n'.join([
    paragraph.text.encode('utf-8') for paragraph in document.paragraphs
])

Just opening this issue with this little code snippet might just serve the purpose of documenting the methodology, but it might be nice to include this somewhere in the documentation or as a script that is installed with the package. I'm happy to contribute.

Do you have any preferences on a script vs documenting this two-liner? If just documenting is enough, any thoughts on where it should go?

scanny commented 10 years ago

Yeah, I'm thinking it probably makes sense to have a Document.text property or something like that that produces a list of strings roughly like this. The question comes up from time to time for indexing purposes and so forth.

This particular snippet misses text that's in tables, so there would need to be a little more to it, but I'm sure it would be modest in size.

One thing other folks have mentioned is also capturing text that's in headers, footers, footnotes, and endnotes. I'm supposing it's enough that some folks will want that, but wondering a little bit about whether it makes sense to keep those bits separate, perhaps returning a tuple like (document_text, hdr_ftr_text, end_and_foot_note_text) so folks could pick and choose without having to go to several different objects to collect it all.

Based on your needs, do you have a point of view on that?

deanmalmgren commented 10 years ago

Ah, I didn't realize this would miss tables, headers, and footers. If that's feasible to do that would be awesome. I've recently started a project to extract text from any document and I think it would be helpful to be able to omit headers and footers but keep tables, for example. In my particular use case, it would actually be beneficial to have the tables correctly interwoven with the body text, so returning as a tuple is less desirable.

Maybe instead of a Document.text property it could be a method that has a signature with optional kwargs that make it easy to select different parts of the text:

class Document(object):
    def get_text(self, omit_tables=False, omit_footers=False, omit_headers=False):
        pass
deanmalmgren commented 9 years ago

This is related to #40 and https://github.com/deanmalmgren/textract/pull/92, too. Just adding this here as a note for myself and anyone else that might take a crack at this.