python-openxml / python-docx

Create and modify Word documents with Python
MIT License
4.54k stars 1.11k forks source link

Remove "generated by python-docx" from description tag #1387

Closed shoang22 closed 5 months ago

shoang22 commented 5 months ago

Hello,

I'm building an translation app that converts pdfs to docx files that I can use to generate xliff, which gets parsed so that I can perform machine translation before merging the translations to the output doc (expected to be in the same format as the input docx).

When I use the parser to extract text from the generated docx file, I get extra text that I'm assuming comes from here. I tried simply removing the lines, but the parser still cannot merge the source and target language docs.

Is there a way to ensure the tags don't get generated?

icy-comet commented 5 months ago

You can just modify these properties through a document's core_properties attribute.

References:

shoang22 commented 5 months ago

Thanks for the reference. What if I wanted to remove select components from core_properties entirely? Initially, I set them as:

docx_doc.core_properties.comments = ""
docx_doc.core_properties.author = ""

The problem with this is that the parser (tikal) still recognizes them. And read them as two blank lines. When attempting to merge with the document containing the target text, I have to add two blank lines to the end of the target file to make it work. Was wondering if there's a more elegant solution.

I tried to delete them but was met with the following:

AttributeError: property 'comments' of 'CoreProperties' object has no deleter
scanny commented 5 months ago

This should do the trick:

# -- corresponds to "comments" --
core_properties._element._remove_description()
# -- corresponds to "author" --
core_properties._element._remove_creator()
shoang22 commented 5 months ago

This should do the trick:

# -- corresponds to "comments" --
core_properties._element._remove_description()
# -- corresponds to "author" --
core_properties._element._remove_creator()

Both of these still set comments and author to an empty string

-> docx_doc.core_properties.comments
'generated by python-docx'

-> docx_doc.core_properties._element._remove_description()

-> docx_doc.core_properties.comments
''
scanny commented 5 months ago

@shoang22 okay, well I'm sure there's a reason we did it that way, possibly because Dublin Core (the "core" in core-properties) attributes should always be type str, even if they are not "filled".

If for your use case you prefer the value None you can use the expression: comments = core_properties.comments or None

>>> core_properties = document.core_properties
>>> core_properties.comments
''
>>> core_properties.comments or None
None