Closed shoang22 closed 5 months ago
You can just modify these properties through a document's core_properties
attribute.
References:
Thanks for the reference. What if I wanted to remove select components from core_properties
entirely?
Initially, I set them as:
docx_doc.core_properties.comments = ""
docx_doc.core_properties.author = ""
The problem with this is that the parser (tikal) still recognizes them. And read them as two blank lines. When attempting to merge with the document containing the target text, I have to add two blank lines to the end of the target file to make it work. Was wondering if there's a more elegant solution.
I tried to delete them but was met with the following:
AttributeError: property 'comments' of 'CoreProperties' object has no deleter
This should do the trick:
# -- corresponds to "comments" --
core_properties._element._remove_description()
# -- corresponds to "author" --
core_properties._element._remove_creator()
This should do the trick:
# -- corresponds to "comments" -- core_properties._element._remove_description() # -- corresponds to "author" -- core_properties._element._remove_creator()
Both of these still set comments and author to an empty string
-> docx_doc.core_properties.comments
'generated by python-docx'
-> docx_doc.core_properties._element._remove_description()
-> docx_doc.core_properties.comments
''
@shoang22 okay, well I'm sure there's a reason we did it that way, possibly because Dublin Core (the "core" in core-properties) attributes should always be type str
, even if they are not "filled".
If for your use case you prefer the value None
you can use the expression:
comments = core_properties.comments or None
>>> core_properties = document.core_properties
>>> core_properties.comments
''
>>> core_properties.comments or None
None
Hello,
I'm building an translation app that converts pdfs to docx files that I can use to generate xliff, which gets parsed so that I can perform machine translation before merging the translations to the output doc (expected to be in the same format as the input docx).
When I use the parser to extract text from the generated docx file, I get extra text that I'm assuming comes from here. I tried simply removing the lines, but the parser still cannot merge the source and target language docs.
Is there a way to ensure the tags don't get generated?