jgm / pandoc

Universal markup converter
https://pandoc.org
Other
34.36k stars 3.37k forks source link

Allow for inline images in ICML #8398

Closed samatcolumn closed 1 year ago

samatcolumn commented 1 year ago

Describe your proposed improvement and the problem it solves.

Currently pandoc uses <Link> to embed images in ICML files: https://github.com/jgm/pandoc/blob/415550a36a9f5cfb412a812836b835d12ec12cb8/src/Text/Pandoc/Writers/ICML.hs#L633

The ICML looks like this:

<Image Self="ue6" ItemTransform="0.29714 0 0 0.94667 -131.25 -56.25">
    <Properties>
      <Profile type="string">
        $ID/Embedded
      </Profile>
    </Properties>
    <Link Self="ueb" LinkResourceURI="" />
  </Image>

However it's often desirable to embed images directly using the <Contents> tag, which would look like this:

<Image Self="ue6" ItemTransform="0.29714 0 0 0.94667 -131.25 -56.25">
    <Properties>
      <Profile type="string">
        $ID/Embedded
      </Profile>
      <Contents>
        <![CDATA[iVBORw0KGgoAAAANSUhEUgAAAV4AAACWBAMAAABkyf1EAAAAG1BMVEXMzMyWlpacnJyqqqrFxcWxsbGjo6O3t7e+vr6He3KoAAAACXBIWXMAAA7EAAAOxAGV
Kw4bAAAEcElEQVR4nO2aTW/bRhCGh18ij1zKknMkbbf2UXITIEeyMhIfRaF1exQLA/JRclslRykO+rs7s7s0VwytNmhJtsA8gHZEcox9PTs7uysQgGEYhmEYhmEYhmEYhmEYhmEYhmEYhmEYhmE
YhmEYhmGYr2OWRK/ReIKI8Zt7Hb19wTcQ0uTkGh13bQupcw7gPOvdo12/5CzNtNR7xLUtNtT3CGBQ6g3InjY720pvofUec22LJPr8PhEp2OMPyI40PdwWUdronCu9yQpdPx53bQlfLKnfOVhlnD
YRBXve4Ov+IZTeMgdedm0NR+xoXJeQvdJ3CvziykSukwil16W/Oe7aGjIjqc/9ib4jQlJy0uArtN4A0+cvXFvDkmUJ47sJ1Y1ATLDNVXZkNPIepQzxy1ki9fqiwbUj/I+64zxWNzyZnPuhvohJ9
K70VvXBixpcu2SAHU+Xd9EKdEJDNpYP3AQr3bQSpPQ6Y6/4dl1z7ZDbArsszjA7L0g7ibB0CDcidUWVoErvIMKZh2Xs0LUzcLW6V5NfiUgNEbaYmAVL6bXl0nJRc+1S72ua/D/cTjGPlQj7eUqd
7A096rYlRjdPYlhz7VIvxpVG3cemDKF+WAwLY/6XelOZKTXXzsC4xvDjjtSN6kHLhLke6PrwM8h1raf40qjrGO7H9aTEbduucjS04ZrYU/4iuS5Z2Hdt0rvCLFdmLEXcU30AGddST62o+sLcf5l
6k7CP+ru4pLYqX/VFyxbm/utQbx/r22ZEbTb2f5I2kns1Y1OQR8ZyofX+TjJxj1Rz7QQVnf1QzR26Oth0ueJVYcRP6ZUPac/Rx/5M6ixO1dhSrT3Y1DpiYmx3tF4ZUdpz9LD/dSg9PXES0LB71B
wcGjKROuV28lnvnv7HHJsezheBGH5+X2CfSfRbMKW+5aGs3JFjMrjGibJc0S7TJzqjHrh2hDybj9XRXNZa89Aro55XBdbW5wti2c/5WJ7jJ1RolVUn/HWpb0I58Tziup6Rx7Dm2hnbRP1GM9PW/
NFmQ4PtVRVN63Wvxfmu5sowDMMwDMMwDMMwDMMwDMMwDMMwzL+CpT//F/6beoV8zb2Jmt4Qryx6lTUCsENQ75HOkhXAO3EPVgyQtKtUy3C/e+FJg17Zjnew1Xrdb9InbG4WqfUAftG+WhLwPVyf
g536+MU7m4C1CMk4ZznpXZzDYI1PDL2nS1hpvc5cNd7E2sJg05Fe7/7d3Fln8Cvc3bwB616auxsKl4WPghjemHrDqyDWeu1UNW5s2btPnSQ75oOdunEwWazfwgVG0kqluYCM9OIjWOGnfA2b9G4
Ha63XKpvQ8perTvTifJNhi6+WMWmi7smEZf6G8MmhlyGq+NqP8GV84TLuJr7UIQVx+bDEoEpRZIz42gs40OuN4Mv8hXzelV7KX1isH+ewTWckikyVv+CfHuqVF7I16gN0VKypX6wPsE+zFPzkin
olU9UH8OMGvSpnZqKsv13p/RsMun6X5x/y2LeAr8O66lsBwzBMP/wJfyGq8pgBk6IAAAAASUVORK5CYII=]]>
      </Contents>
    </Properties>
</Image>

It would be great if there was a flag or an HTML data-* attribute we could set to have Pandoc choose the latter format.

Describe alternatives you've considered.

Currently we are post-processing the ICML output by Pandoc to achieve this.

jgm commented 1 year ago

What you're looking for is for --embed-resources to work with ICML. Sorry, I see now that we have an embedded resource in either case; the question is just what syntax is used.

Can you say more about what difference it makes? Why is a Contents element preferable to a Link with a data URI? Is it preferable in all cases? If so, we could just make that the default.

samatcolumn commented 1 year ago

@jgm ICML / IDML are such arcane formats I can't say it's preferable in all cases but for our use case this is required in order to get Indesign Server to actually render the image when we load the ICML and then export PDF / JPG / etc.

Sorry I wish that was more helpful! I can do some more digging through the ICML / IDML docs and see if I can understand more.

leohentschker commented 1 year ago

Following up here! From my perspective using Contents is preferable to using Links as Content renders immediately in indesign and the Link elements do not.

Here is documentation from Adobe that walks through some of the reasons to use linked vs embedded content. The primary benefit of using Links is to decrease overall file size by using URIs as opposed to base64 content. But if the Link is already including the base64 image, the benefit is lost.

jgm commented 1 year ago

OK, I'll just change the default behavior, then.

jgm commented 1 year ago

So we're dropping information that the file is image/png; how does it know? Does it determine this from the contents of the file? (I know that's possible -- just wondering.) What file types can go in Contents in this way?

samatcolumn commented 1 year ago

@jgm I just did a test with Indesign, I created two blank canvases and pasted a png into one and a jpg into another. Neither had a mime type in the <Image> tag or any other file type information.

samatcolumn commented 1 year ago

@jgm wow thanks for closing this so fast! You are a legend.

samatcolumn commented 1 year ago

@jgm do you have any plans for a release soon (2.19.3 or 2.20.0) which would include this feature or should I just keep building from source?

jgm commented 1 year ago

It's going to be 3.0 when it happens. Still waiting on some changes to the Lua subsystem, but we are getting pretty close. For now you can build from source or use a nightly.

samatcolumn commented 1 year ago

@jgm ah a nightly sounds like what I need! I'm likely just being dense but I wasn't able to find nightly releases in this repo or on the pandoc site ... Google points me to https://github.com/pandoc-extras/pandoc-nightly but that seems to have stopped in mid-2020. Would you mind pointing me to the best place to grab a nightly .exe?

jgm commented 1 year ago

https://github.com/jgm/pandoc/actions/workflows/nightly.yml

ptram commented 1 year ago

I'll leave a note to say that, in general, linking images is the preferred solution in InDesign. Embedded resources make the InDesign file too heavy, filling the RAM, the virtual memory, and causing crashes. This is particularly evident with longer documents, but InDesign is mostly meant for long documents.

samatcolumn commented 1 year ago

@ptram agreed! The hard part for me was understanding how to make a portable ICML file with links to images ... seems like it'd need to be a zip or something?

Do you know of a way?

ptram commented 1 year ago

Do you know of a way?

Yes, you have the page layout file (ICML, IDML or INDD), and a folder containing the images (the one InDesign calls Links when "packaging for print"). You can then zip them and deliver the zip containing the page layout and the linked resources.

If there are different needs, I would suggest to have two separate ICML writers – one for linked images, the other one for embedded images (what does @jgm think about this?). In any case, the standard use in InDesign is with linked images (as it happens with markdown). Embedded is considered a beginner's sin.

Apart for the size considerations found above, linked images mean that collaborators working on the images can change them from draft to final, and the page layout document will be able to automatically update them. This is essential in production.

If you break a link when going from Markdown to ICML, the link is gone forever. And rebuilding links on long documents is a nightmare.

This is clearly explained by Adobe themselves (as linked above).

In general, I think that the idea behind using markdown (make the main document simple, and link everything) is a winning one. ICML is nothing more than a glorified ancestor of markdown, made trying to make a bridge between PostScript and XML.

Paolo

ptram commented 1 year ago

I will go a bit more into the details about the InDesign workflow, just to be sure @jgm gets my point of view as clearly as possible. I will try to demonstrate that images should be normally linked in the ICML writer, and not embedded. Or, at least, there should be an option to make them linked.

InDesign (and this applies as well to QuarkXPress, Affinity Publisher or Viva Publisher) is a page layout program intended for creating brochures, leaflets, books, magazines, and anything that blends text and images and is intended for print or virtual print (PDF). There are people using it for writing novels or theses, but this is not the intended use and the best tool for it.

A typical workflow, in InDesign, is to have writers and visual artists produce their content, be it a narration, instructions, a series of illustrations, photos, software screenshots. The original content is usually created with dedicated tools, for example word processors, photo editor, CAD programs. These contributions are then assembled into an InDesign document. Several InDesign documents can be assembled into a Book, so that they get a common set of styles and formats.

In this workflow, images are linked. The writer and the page layout artist put a placeholder in the early version of the text or page layout document. The placeholder may be an early version of the illustration or screenshot, or a dummy photo with a size similar to the final one. They give the dummy placeholder a name and file path that is the same as the final image.

In something like VS Code, dragging an image file into a text document automatically adds its path. If you want, you can also see a preview. Word processors vary, but some of them can also link images and only show a preview of them. As far as I know, they usually convert complex external files (like TIFF or PDF) into simpler bitmap data, flattening layers and effects.

When the final image is ready, it is replaced, in the linked images folder, to the dummy placeholder. InDesign automatically updates it, by reading the saved file path contained in the page layout document, with the text and page layout remaining the same as they were with the temporary dummy image.

This couldn't happen with embedded images, that InDesign wouldn't know where to find outside of its document. Updating would mean linking or importing it again.

In my view, if you use Pandoc, you are after a markdown workflow; otherwise, you could simply import an RTF or DOCX file, including the embedded images it contains. A markdown workflow is essentially based on a separation between elements – text, external resources, style appearance. This is to facilitate reuse and confluence of contributions.

For this reason, I think the ICML writer should allow, or even privilege, linked images. Without linked images, images would have to be reimported after loading the ICML file into InDesign. This would have to be repeated each time the ICML document is generated again.

Paolo

jgm commented 1 year ago

I'd be fine to change to use linked images, unless people have objections we haven't considered. Can you give an example of how the emitted code should change?

ptram commented 1 year ago

Can you give an example of how the emitted code should change?

Dear John, thank you very much for your willingness to implement this change. I'm attaching the original InDesign (INDD), the exported full document (IDML), and the exported snippet (ICML) from a very simple document, only containing a frame, a paragraph, and a linked image. I'm also attaching the linked image.

I guess the interesting code is this one, where text and the rectangle containing the linked image are declared:

`

This is a linked image:
$ID/None ` and this one, where the file path is declared: ` [etc.] ` Does this give you some useful information? Paolo [icml-export-example.zip](https://github.com/jgm/pandoc/files/12386739/icml-export-example.zip)
jgm commented 1 year ago

I guess the most useful thing would be a diff between pandoc's current output and the output you desire. I know nothing about the format so I'd need very precise instructions.

ptram commented 1 year ago

I guess the most useful thing would be a diff between pandoc's current output and the output you desire. I know nothing about the format so I'd need very precise instructions.

Oh my, I'm the one who knows nothing about the needed code. What I can say is that I made myself sure I had the latest version of Pandoc (3.1.6.1), and tried to reconvert an MD file generated by Scrivener.

The result is what I desired, but didn't expect: the linked images were actually linked (image from Affinity Publisher):

image

The ICML code generated by Pandoc indeed includes the file path:

`

$ID/Embedded
    <Link Self="ueb" LinkResourceURI="file:/Users/[...]/Documents/Markdown/mmd-from-scriv.md/images/CC3.png" />
  </Image>

`

The odd thing is that after importing the ICML file into InDesign CS6, the Links pane doesn't allow for relinking the images. The file path is showing the path, but the Relink command is not available. Maybe that this is caused by the $ID/Embedded instruction included in the generated ICML file. It looks like a linked image has the file type (PNG, JPEG…) instead of the "Embedded" declaration.

After exporting the InDesign file to an IDML file, Affinity Publisher can see the images as linked ones, and allows for relinking. The further export seems to fix the issue.

This discussion seems related to this issue, so maybe you can gather some useful info from it? There is a snipped of code that might contain an example similar to the one we are after:

How to extract image from IDML file using IDMLlib

I must say that at the moment, for reasons that I'm not able to explain, the ICML writer is doing what I think it should do – preserving the links to the original image files.

Paolo

ptram commented 1 year ago

(While doing my tests, I found that the writer doesn't allow relative paths, like "/images/image.png", but it requires the full absolute path. This can probably be worked around in InDesign/Publisher, but it may be an inconvenience. Should I open another issue to report this?).

jgm commented 1 year ago

/images/image.png is an absolute path. Maybe you want ./images/image.png or images/image.png?

ptram commented 1 year ago

/images/image.png is an absolute path. Maybe you want ./images/image.png or images/image.png?

Sorry, you are perfectly right. Yes, I would like something like ./images/image.png or images/image.png, relative to the position of the MD file to be converted. Apparently, at the moment they are not recognized by the writer (or by Pandoc as a whole?).

jgm commented 1 year ago

Relative paths are interpreted relative to the working directory (which might be different from the directory containing the markdown file). But see the documentation on the rebase_relative_paths extension.

ptram commented 1 year ago

Relative paths are interpreted relative to the working directory

I've read a bit about this issue, and I understand it's a complex task that I have to study more in depth.

As for the $ID/Embedded declaration replaced with the file format, it didn't work in my tests. I tried with $ID/PNG without changing anything else, and there is no difference in how InDesign CS6 behaved.

The odd thing is that even the exported IDML file still results in images that can't be relinked in InDesign CS6, while can be relinked in Affinity Publisher 2. Probably, the age difference shows.

ptram commented 1 year ago

Just to be sure: the official reference to IDML/ICML is still available via GitHub:

IDML 7.0 Specification

It is very interesting how InDesign looks for the location of the linked files:

InDesign first looks for the link in the folder containing the IDML file. If the file cannot be located in the folder, InDesign searches for the file by using the file path to the IDML document. If the file is still not found, InDesign goes up one level in the IDML document’s path, and tries again. Finally, InDesign looks for the link in the folders that have been specified by the user when updating file links during the current InDesign session.

samatcolumn commented 1 year ago

Are there other Pandoc export targets where the file has "links" to other local files? The reason we preferred embedded images was because we send the ICML to a server and wanted an "all in one" format. If we move to linked images (which does sound nice) I'd like to understand if there's a standard I should be using to re-assemble the required file system on the other side.

Also since my use case is HTML -> ICML I'll say that I think there are three distinct things to care about:

I assume that the three cases above have already been well handled by Pandoc for converting HTML -> Any other format that supports images. I'm probably showing my Pandoc ignorance!

ptram commented 1 year ago

Are there other Pandoc export targets where the file has "links" to other local files?

I'm not a Pandoc erudite, but I would believe that the following ones are formats that are written with images as links to separate resources:

Markdown, HTML, ePub, LaTeX, OPML, MediaWiki markup.

Paolo

ptram commented 1 year ago

I've just made a test with a big file, compiled by Scrivener to MMD and converted by Pandoc to ICML. Links to the original image files are preserved.

Paolo