Closed iandol closed 4 months ago
The code that generates the opendocument
figure output is here:
https://github.com/jgm/pandoc/blob/main/src/Text/Pandoc/Writers/OpenDocument.hs#L647
BUT I can't work out how the ODT writer changes this?
There is already a --embed-resources-true|false
and I wonder if this could be triggered like:
> pandoc -t odf --embed-resources=false -o out.odt in.md
Which would trigger the images-as-links. This issue is for ODT as I think it should be a simple change, but DOCX also supports images-as-links (with more complex OpenXML changes needed)...
I think it might be confusing if --embed-resources
had a default of true
for odt but a default of false
otherwise...
I think it might be confusing if
--embed-resources
had a default oftrue
for odt but a default offalse
otherwise...
Do I understand correctly, from the message from @iandol above, that OpenDocument is actually written with embedding set to false, and ODT and DOCX with it set to true?
Paolo
I think it might be confusing if
--embed-resources
had a default oftrue
for odt but a default offalse
otherwise...
Right, if used this would need to be clearly documented. The alternative is a new command-line option which will probably sound similar (--embed-images
or maybe --link-images
) and I imagine increases maintenance a bit more.
Do I understand correctly, from the message from @iandol above, that OpenDocument is actually written with embedding set to false, and ODT and DOCX with it set to true?
I don't think this is explicitly controlled. The opendocument writer uses links (technically embedding false, but not some sort of global switch), and this is somehow changed for ODT and DOCX always embeds. Having looked at the writers at least for me who knows no Haskell, I couldn't see an easy implementation.
I will add a reason for implementing this feature (linked/embedded images): while embedding may produce easier-to-handle ODT or DOCX files, it would also prevent them from being used as an intermediate file format for going from Markdown to a page layout program.
Programs like InDesign, Affinity Publisher, QuarkXPress, Scribus, can all import the RTF or DOCX file format. They are unfortunately unable to import Markdown. Apparently, there is no way to make Markdown compatibility a priority. Hence the importance of Pandoc in the process. If image links and names were preserved in the translation, the original aim of the Markdown project would also be preserved.
A page layout program as a last step before generating a PDF or ePub file is very useful, since many details can be finely adjusted in a way that isn't when programming the output thinking to LaTeX or Typst. Pandoc would allow a smooth integration between the ease of authoring offered by Markdown, and the fine control on details and high-quality typographic output offered by page layout programs.
Incidentally: I'm completing a PDF project started in Markdown, and completed in Word for the impossibility to finish it in a page layout program, due to the lack of a reasonable way to go from Markdown to a page layout program. I hate Word, I hate the world! Nobody dare talking to me today!
It would be easy enough to modify the ODT writer to optionally skip the step that embeds the images. (transformPicMath
function...though presumably we still want the math part of that.)
The difficulty is figuring out what should trigger this. As I mentioned, it would be weird to make --embed-resources=false
trigger it, because false
is the default. In addition, --embed-resources
is just for HTML.
One could add another option I suppose.
How about --link-images=true|false
? I thought about --embed-images
but it sounds confusingly similar to --embed-resources
...
I think --link-images
makes sense. I suppose that at first we could implement this for ODT only -- maybe it's also possible for docx.
Right, ODT appears straightforward as usual, and LibreOffice can convert to DOCX for anyone who needs DOCX.
While i think DOCX is low priority, out of curiosity I generated a minimal DOCX with a linked image to demonstrate the desired output. In word/document.xml
the inline linked image is encoded by this baroque XML:
<w:r w:rsidR="00E10F1E">
<w:rPr>
<w:noProof/>
</w:rPr>
<w:drawing>
<wp:inline distT="0" distB="0" distL="0" distR="0">
<wp:extent cx="1270000" cy="419100"/>
<wp:effectExtent l="0" t="0" r="0" b="0"/>
<wp:docPr id="744031760" name="placeholder.png"/>
<wp:cNvGraphicFramePr>
<a:graphicFrameLocks xmlns:a="http://schemas.openxmlformats.org/drawingml/2006/main" noChangeAspect="1"/>
</wp:cNvGraphicFramePr>
<a:graphic xmlns:a="http://schemas.openxmlformats.org/drawingml/2006/main">
<a:graphicData uri="http://schemas.openxmlformats.org/drawingml/2006/picture">
<pic:pic xmlns:pic="http://schemas.openxmlformats.org/drawingml/2006/picture">
<pic:nvPicPr>
<pic:cNvPr id="744031760" name="placeholder.png"/>
<pic:cNvPicPr/>
</pic:nvPicPr>
<pic:blipFill>
<a:blip r:link="rId4"/>
<a:stretch>
<a:fillRect/>
</a:stretch>
</pic:blipFill>
<pic:spPr>
<a:xfrm>
<a:off x="0" y="0"/>
<a:ext cx="1270000" cy="419100"/>
</a:xfrm>
<a:prstGeom prst="rect">
<a:avLst/>
</a:prstGeom>
</pic:spPr>
</pic:pic>
</a:graphicData>
</a:graphic>
</wp:inline>
</w:drawing>
</w:r>
The link to disk is stored in word/_rels/document.xml.rels
as Id="rId4"
:
<?xml version="1.0" encoding="UTF-8" standalone="yes"?>
<Relationships xmlns="http://schemas.openxmlformats.org/package/2006/relationships">
<Relationship Id="rId3" Type="http://schemas.openxmlformats.org/officeDocument/2006/relationships/webSettings" Target="webSettings.xml"/>
<Relationship Id="rId2" Type="http://schemas.openxmlformats.org/officeDocument/2006/relationships/settings" Target="settings.xml"/>
<Relationship Id="rId1" Type="http://schemas.openxmlformats.org/officeDocument/2006/relationships/styles" Target="styles.xml"/>
<Relationship Id="rId6" Type="http://schemas.openxmlformats.org/officeDocument/2006/relationships/theme" Target="theme/theme1.xml"/>
<Relationship Id="rId5" Type="http://schemas.openxmlformats.org/officeDocument/2006/relationships/fontTable" Target="fontTable.xml"/>
<Relationship Id="rId4" Type="http://schemas.openxmlformats.org/officeDocument/2006/relationships/image" Target="file:////Users/ian/placeholder.png" TargetMode="External"/>
</Relationships>
Sorry forgot to add the docx:
placeholder.png:
Note Word uses an absolute path whereas LibreOffice uses a relative path, I will test if a relative path will work if I manually edit the XML...
EDIT: using Target="file:///placeholder.png"
seemed to work (I got a warning on opening, probably as I manually edited a file) and saving to a new word document kept the relative path.
Here's a patch that would add --link-images
.
diff --git a/src/Text/Pandoc/App/CommandLineOptions.hs b/src/Text/Pandoc/App/CommandLineOptions.hs
index c3abe1ba1..c50ec6208 100644
--- a/src/Text/Pandoc/App/CommandLineOptions.hs
+++ b/src/Text/Pandoc/App/CommandLineOptions.hs
@@ -601,6 +601,14 @@ options =
"true|false")
"" -- "Make slide shows include all the needed js and css"
+ , Option "" ["link-images"] -- maybe True (\argStr -> argStr == "true") arg
+ (OptArg
+ (\arg opt -> do
+ boolValue <- readBoolFromOptArg "--link-images" arg
+ return opt { optLinkImages = boolValue })
+ "true|false")
+ "" -- "Link images in ODT rather than embedding them"
+
, Option "" ["request-header"]
(ReqArg
(\arg opt -> do
diff --git a/src/Text/Pandoc/App/Opt.hs b/src/Text/Pandoc/App/Opt.hs
index c1f16279c..b6050f117 100644
--- a/src/Text/Pandoc/App/Opt.hs
+++ b/src/Text/Pandoc/App/Opt.hs
@@ -119,6 +119,7 @@ data Opt = Opt
, optIncremental :: Bool -- ^ Use incremental lists in Slidy/Slideous/S5
, optSelfContained :: Bool -- ^ Make HTML accessible offline (deprecated)
, optEmbedResources :: Bool -- ^ Make HTML accessible offline
+ , optLinkImages :: Bool -- ^ Link ODT images rather than embedding
, optHtmlQTags :: Bool -- ^ Use <q> tags in HTML
, optHighlightStyle :: Maybe Text -- ^ Style to use for highlighted code
, optSyntaxDefinitions :: [FilePath] -- ^ xml syntax defs to load
@@ -201,6 +202,7 @@ instance FromJSON Opt where
<*> o .:? "incremental" .!= optIncremental defaultOpts
<*> o .:? "self-contained" .!= optSelfContained defaultOpts
<*> o .:? "embed-resources" .!= optEmbedResources defaultOpts
+ <*> o .:? "link-images" .!= optLinkImages defaultOpts
<*> o .:? "html-q-tags" .!= optHtmlQTags defaultOpts
<*> o .:? "highlight-style"
<*> o .:? "syntax-definitions" .!= optSyntaxDefinitions defaultOpts
@@ -526,6 +528,8 @@ doOpt (k,v) = do
parseJSON v >>= \x -> return (\o -> o{ optSelfContained = x })
"embed-resources" ->
parseJSON v >>= \x -> return (\o -> o{ optEmbedResources = x })
+ "link-images" ->
+ parseJSON v >>= \x -> return (\o -> o{ optLinkImages = x })
"html-q-tags" ->
parseJSON v >>= \x -> return (\o -> o{ optHtmlQTags = x })
"highlight-style" ->
@@ -738,6 +742,7 @@ defaultOpts = Opt
, optIncremental = False
, optSelfContained = False
, optEmbedResources = False
+ , optLinkImages = False
, optHtmlQTags = False
, optHighlightStyle = Just "pygments"
, optSyntaxDefinitions = []
diff --git a/src/Text/Pandoc/App/OutputSettings.hs b/src/Text/Pandoc/App/OutputSettings.hs
index d08cb626b..11d813e5e 100644
--- a/src/Text/Pandoc/App/OutputSettings.hs
+++ b/src/Text/Pandoc/App/OutputSettings.hs
@@ -262,6 +262,7 @@ optToOutputSettings scriptingEngine opts = do
, writerReferenceDoc = optReferenceDoc opts
, writerSyntaxMap = syntaxMap
, writerPreferAscii = optAscii opts
+ , writerLinkImages = optLinkImages opts
}
return $ OutputSettings
{ outputFormat = format
diff --git a/src/Text/Pandoc/Options.hs b/src/Text/Pandoc/Options.hs
index 20aec2624..e4ff56b77 100644
--- a/src/Text/Pandoc/Options.hs
+++ b/src/Text/Pandoc/Options.hs
@@ -325,6 +325,7 @@ data WriterOptions = WriterOptions
, writerReferenceLocation :: ReferenceLocation -- ^ Location of footnotes and references for writing markdown
, writerSyntaxMap :: SyntaxMap
, writerPreferAscii :: Bool -- ^ Prefer ASCII representations of characters when possible
+ , writerLinkImages :: Bool -- ^ Use links rather than embedding ODT images
} deriving (Show, Data, Typeable, Generic)
instance Default WriterOptions where
@@ -363,6 +364,7 @@ instance Default WriterOptions where
, writerReferenceLocation = EndOfDocument
, writerSyntaxMap = defaultSyntaxMap
, writerPreferAscii = False
+ , writerLinkImages = False
}
instance HasSyntaxExtensions WriterOptions where
diff --git a/src/Text/Pandoc/Writers/ODT.hs b/src/Text/Pandoc/Writers/ODT.hs
index 8464a01e0..8eec979d9 100644
--- a/src/Text/Pandoc/Writers/ODT.hs
+++ b/src/Text/Pandoc/Writers/ODT.hs
@@ -272,15 +272,19 @@ transformPicMath opts (Image attr@(id', cls, _) lab (src,t)) = catchError
Just dim -> Just $ Inch $ inInch opts dim
Nothing -> Nothing
let newattr = (id', cls, dims)
- entries <- gets stEntries
- let extension = maybe (takeExtension $ takeWhile (/='?') $ T.unpack src) T.unpack
- (mbMimeType >>= extensionFromMimeType)
- let newsrc = "Pictures/" ++ show (length entries) <.> extension
- let toLazy = B.fromChunks . (:[])
- epochtime <- floor `fmap` lift P.getPOSIXTime
- let entry = toEntry newsrc epochtime $ toLazy img
- modify $ \st -> st{ stEntries = entry : entries }
- return $ Image newattr lab (T.pack newsrc, t))
+ src' <- if writerLinkImages opts
+ then return src
+ else do
+ entries <- gets stEntries
+ let extension = maybe (takeExtension $ takeWhile (/='?') $ T.unpack src) T.unpack
+ (mbMimeType >>= extensionFromMimeType)
+ let newsrc = "Pictures/" ++ show (length entries) <.> extension
+ let toLazy = B.fromChunks . (:[])
+ epochtime <- floor `fmap` lift P.getPOSIXTime
+ let entry = toEntry newsrc epochtime $ toLazy img
+ modify $ \st -> st{ stEntries = entry : entries }
+ return $ T.pack newsrc
+ return $ Image newattr lab (src', t))
(\e -> do
report $ CouldNotFetchResource src $ T.pack (show e)
return $ Emph lab)
However, it doesn't work (at least, LibreOffice raises an error and does not display the image). Had you actually tested ODTs with the linked images?
Yes, ODT definitely supports links. Here is a linked doc (same placeholder.png
as above):
Saved as an ODT and flat FODT. Flanked by "Pre." and "Post." paragraphs:
The GUI shows an absolute path but it is saved slightly differently between the ODT (../placeholder.png
) and FODT (placeholder.png
):
ODT:
<text:p text:style-name="Standard">
<draw:frame draw:style-name="fr1" draw:name="Image1" text:anchor-type="as-char" svg:width="4.759cm" style:rel-width="28%" svg:height="1.549cm" style:rel-height="scale" draw:z-index="0">
<draw:image xlink:href="../placeholder.png" xlink:type="simple" xlink:show="embed" xlink:actuate="onLoad" draw:filter-name="<All images>" draw:mime-type="image/png"/>
</draw:frame>
</text:p>
FODT:
<text:p text:style-name="Standard">
<draw:frame draw:style-name="fr1" draw:name="Image1" text:anchor-type="as-char" svg:width="4.759cm" style:rel-width="28%" svg:height="1.549cm" style:rel-height="scale" draw:z-index="0">
<draw:image xlink:href="placeholder.png" xlink:type="simple" xlink:show="embed" xlink:actuate="onLoad" draw:filter-name="<All images>" draw:mime-type="image/png"/>
</draw:frame>
</text:p>
I wonder if there is something else in the document that is required. OK, here I take a Pandoc generated ODT:
<office:body>
<office:text>
<text:p text:style-name="Text_20_body">Pre.</text:p>
<text:p text:style-name="Text_20_body">Post.</text:p>
</office:text>
</office:body>
And open it, and add a linked image pandoc+link.odt:
<office:body>
<office:text>
<text:sequence-decls>
<text:sequence-decl text:display-outline-level="1" text:separation-character="." text:name="Illustration"/>
<text:sequence-decl text:display-outline-level="0" text:name="Table"/>
<text:sequence-decl text:display-outline-level="0" text:name="Text"/>
<text:sequence-decl text:display-outline-level="0" text:name="Drawing"/>
<text:sequence-decl text:display-outline-level="0" text:name="Figure"/>
</text:sequence-decls>
<text:p text:style-name="P2">Pre.</text:p>
<text:p text:style-name="P2">
<draw:frame draw:style-name="fr1" draw:name="Image1" text:anchor-type="as-char" svg:width="16.51cm" style:rel-width="100%" svg:height="5.447cm" style:rel-height="scale" draw:z-index="0">
<draw:image xlink:href="../placeholder.png" xlink:type="simple" xlink:show="embed" xlink:actuate="onLoad" draw:filter-name="<All images>" draw:mime-type="image/png"/>
</draw:frame>
</text:p>
<text:p text:style-name="Text_20_body">Post.</text:p>
</office:text>
</office:body>
<text:sequence-decls>
gets added into the office:body. I will try several ablation experiments to see what causes ODT o fail to load.
note: importing a linked image in LO wraps it into a caption box and floats it; I am manually removing the caption box and unfloating the image (making it inline) to try to simplify the testcase. I will need to test a captioned image later on...
Here's another comparison. I generated an ODT with image with Pandoc:
<office:body>
<office:text>
<text:p text:style-name="Text_20_body">Pre.</text:p>
<text:p text:style-name="Text_20_body">
<draw:frame draw:name="img1" svg:width="200.0pt" svg:height="66.0pt">
<draw:image xlink:href="Pictures/0.png" xlink:type="simple" xlink:show="embed" xlink:actuate="onLoad" />
</draw:frame>
</text:p>
<text:p text:style-name="Text_20_body">Post.</text:p>
</office:text>
</office:body>
Duplicated it and then converted the image to a link (see screenshot above you can add a link filename which turns and embedd into a linked image):
<office:body>
<office:text>
<text:sequence-decls><text:sequence-decl text:display-outline-level="0" text:name="Illustration"/><text:sequence-decl text:display-outline-level="0" text:name="Table"/><text:sequence-decl text:display-outline-level="0" text:name="Text"/><text:sequence-decl text:display-outline-level="0" text:name="Drawing"/><text:sequence-decl text:display-outline-level="0" text:name="Figure"/></text:sequence-decls>
<text:p text:style-name="Text_20_body">Pre.</text:p>
<text:p text:style-name="Text_20_body">
<draw:frame draw:style-name="fr1" draw:name="img1" text:anchor-type="as-char" svg:width="7.056cm" svg:height="2.328cm" draw:z-index="0">
<draw:image xlink:href="../placeholder.png" xlink:type="simple" xlink:show="embed" xlink:actuate="onLoad" draw:filter-name="<All images>" draw:mime-type="image/png"/></draw:frame>
</text:p>
<text:p text:style-name="Text_20_body">Post.</text:p>
</office:text>
</office:body>
In the Pandoc untouched ODT is a META-INF/metadata.xml that does point to the Pictures/o.png image insode the ODT:
<?xml version="1.0" encoding="utf-8"?>
<manifest:manifest xmlns:manifest="urn:oasis:names:tc:opendocument:xmlns:manifest:1.0" manifest:version="1.3">
<manifest:file-entry manifest:media-type="application/vnd.oasis.opendocument.text" manifest:full-path="/" manifest:version="1.3" />
<manifest:file-entry manifest:media-type="application/xml" manifest:full-path="content.xml" />
<manifest:file-entry manifest:media-type="image/png" manifest:full-path="Pictures/0.png" />
<manifest:file-entry manifest:media-type="application/rdf+xml" manifest:full-path="manifest.rdf" />
<manifest:file-entry manifest:media-type="application/xml" manifest:full-path="styles.xml" />
<manifest:file-entry manifest:media-type="application/xml" manifest:full-path="meta.xml" />
</manifest:manifest>
If you can upload a non-working ODT I can have a better look.
Here is an even more minimal file. I generated an ODT with Pandoc, duplicated it and edited it as follows:
Pictures
folder.<manifest:file-entry manifest:media-type="image/png" manifest:full-path="Pictures/0.png" />
from manifest.xml
content.xml
to xlink:href="../placeholder.png"
This produces a working ODT with a linked image (placeholder.png from above in the same folder):
Now, if I remove ../
from the href, then ODT complains:
So it seems the link must be ../
to point to the parent folder, in this case I assume LO treats the zip root as ./
which then makes sense.
OK, that was what I was missing: we have to put ../
in front of the relative path in the ODT.
I tried Pandoc 3.2.1 on my Intel Mac, and indeed the image path and name is included in the DOCX file. I did my conversion though Quarto 1.6.1.
However, Word for Mac continues to embed an image with its own RTF name. When imported into InDesign or Affinity Publisher, only the embedded image is considered.
I don't know if this is still something that can be solved on the Pandoc side, or it is something inside Word or the page layout programs that are importing it.
@ptram this patch only affects ODT, not DOCX.
I tried Pandoc 3.2.1 on my Intel Mac, and indeed the image path and name is included in the DOCX file. I did my conversion though Quarto 1.6.1.
Also this is only testable in a nightly build: https://github.com/jgm/pandoc/actions/workflows/nightly.yml for example: https://github.com/jgm/pandoc/actions/runs/9790558535/artifacts/1667189311 -- it hasn't made it to a release yet.
This is a simple test with an image called placeholder.png
in the same folder: pandoc > ODT
Then saved with LO as DOCX:
At least saved from LO the DOCX ZIP does not embed the image and shows the image is external.
<Relationship Id="rId2" Type="http://schemas.openxmlformats.org/officeDocument/2006/relationships/image" Target="file:///Users/ian/placeholder.png" TargetMode="External"/>
Word treats it as an absolute path, but I assume it will adjust based on the loading location? But how that imports I can't test.
LibreOffice is better at a bunch of stuff and often makes a better intermediate than Word itself.
Apologies for doing the usual mess.
In Word for Mac, the above "out.docx" file translates this way:
How the file path translates I can't say, since I've yet to discover a way to show it in Word (I read this feature may have been removed in recent years, for privacy reasons). I'm not even able to see a file name and path inside the DOCX file, when examining it as raw text.
@ptram So, the issue is that you are using --link-images
to create an ODT, and then converting the resulting ODT to docx, and getting the wrong link path. I think I understand why. When we add the image as a link to the ODT, we have to include a ../
prefix to the path (so the link points to ../placeholder.png
, not ./placeholder.png
. I assume the docx fails because it doesn't see the image at ../placeholder.png
(it is resolving the path relative to the working directory). You could test this by moving placeholder.png up to ../
and see if the docx then works.
Describe your proposed improvement and the problem it solves.
For many formatting workflows, editors or publishers prefer not to embed figures. ODT allows you to easily embed or link images, and in fact the
opendocument
writer already supports linking:BUT
odt
forces the image to be embedded, so the same markdown becomes something like:It would be great if there was a command-line option to allow to link to images (i.e. preserve the opendocument way for odt). This way we could generate ODT files with figures that were linked. The same technically applies to DOCX (Word does allow linking, but of course the syntax is much more complex).
Describe alternatives you've considered.
I imagine a Lua filter could do this, and I suspect it is a viable workaround?