jgm / pandoc

Universal markup converter
https://pandoc.org
Other
34.14k stars 3.35k forks source link

Don't add styles in the output document when converting with the option --reference-doc #10088

Open me-kell opened 1 month ago

me-kell commented 1 month ago

Don't add any other styles in the output document other than the ones existing in the input document when converting with the option --reference-doc (and the extension +styles)

Currently when converting a DOCX-Document with the +styles extension and itself as --reference-doc with

pandoc input.docx -f docx+styles -t docx -o output.docx --reference-doc input.docx

following styles are added in the output document (which were not in the input document):

AlertTok , AnnotationTok , AttributeTok , BaseNTok , BuiltInTok , CharTok , CommentTok , CommentVarTok , ConstantTok , ControlFlowTok , DataTypeTok , DecValTok , DocumentationTok , ErrorTok , ExtensionTok , FloatTok , FunctionTok , ImportTok , InformationTok , KeywordTok , NormalTok , OperatorTok , OtherTok, PreprocessorTok , RegionMarkerTok , SourceCode , SpecialCharTok , SpecialStringTok , StringTok , VariableTok , VerbatimStringTok , WarningTok

The input.docx is an empty document created with a "clean" Normal.dotm.

Is there a way to disable the creation of those styles when the --reference-doc option is given?

me-kell commented 1 month ago

Pandoc is also adding some settings in word/settings.xml not existing in the input document:

    <w:displayHorizontalDrawingGridEvery w:val="0"/>
    <w:displayVerticalDrawingGridEvery w:val="0"/>
    <w:doNotTrackMoves/>
    <w:drawingGridHorizontalSpacing w:val="360"/>
    <w:drawingGridVerticalSpacing w:val="360"/>
    <w:embedSystemFonts/>
    <w:footnotePr>
        <w:footnote w:id="0"/>
        <w:footnote w:id="-1"/>
    </w:footnotePr>
    <w:hyphenationZone w:val="425"/>
    <w:listSeparator w:val=";"/>
    <w:proofState w:grammar="clean" w:spelling="clean"/>
    <w:rsids/>
    <w:savePreviewPicture/>
    <w:stylePaneFormatFilter w:val="0004"/>
me-kell commented 1 month ago

AFAICS Pandoc uses the files in data/doc.

When --reference-doc my_reference_doc.docx option is passed to pandoc, why not using the files in my_reference_doc.docx instead of those in data/doc?

jgm commented 1 month ago

We do carry over some things from the reference.docx. But if we just used everything, we'd get corrupt files (tried that; see e.g. #9522). So we use a conservative approach to guarantee that the docx we produced is not corrupt. It may be that we can be less conservative about some things. See also #7240.

jgm commented 1 month ago

Here is the code relevant to generating settings.xml:

https://github.com/jgm/pandoc/blob/main/src/Text/Pandoc/Writers/Docx.hs#L474-L577

jgm commented 1 month ago

The styles AlertTok , AnnotationTok , AttributeTok , BaseNTok , BuiltInTok , CharTok , CommentTok , CommentVarTok , ConstantTok , ControlFlowTok , DataTypeTok , DecValTok , DocumentationTok , ErrorTok , ExtensionTok , FloatTok , FunctionTok , ImportTok , InformationTok , KeywordTok , NormalTok , OperatorTok , OtherTok, PreprocessorTok , RegionMarkerTok , SourceCode , SpecialCharTok , SpecialStringTok , StringTok , VariableTok , VerbatimStringTok , WarningTok are for syntax highlighting. They are generated and depend on the highlighting style you specify. If you specify --no-highlight, they should not appear.

jgm commented 1 month ago

PS. Also please state your pandoc version.