Automated markdown image embedding

The problem

Creating and maintaining markdown files that incorporate images from Mathematica is needlessly tedious, involving manual copy-pasting from a live notebook, exporting to an image file on disk, and then hyperlinking the file in to the markdown document. I think we can improve that with automated tooling.

Possible solution

The most basic technology would be:

A render script wolframscript that processes a markdown file, finds WL code cells within it that indicate they wish to 'embed their output', and then runs these code cells to produce the desired images. These images are processed, compressed (e.g. using PNG limited-palette compression when appropriate), and automatically placed into the correct folder and the corresponding hyperlink in the .md file updated to point to the new file. If the images already exist within the .md file, they are merely replaced or updated.
A more sophisticated technology would involve a bidirectional, lossless conversion between (a subset of) Mathematica notebooks and markdown files. This is not crucial to solve this issue, but is a missing piece of technology that would it a practical reality to work with markdown-based research in a WYSIWYG fashion for those who wish to do so. This would make cross-publishing to Wolfram Physics bulletins a more friendly experience.

I discuss some small elaborations in the point below.

Alternative solutions

Continue doing what we currently do. Personally I refer working in semi-WYSIWYG editors since they allow rapid prototyping of the 'flow' of a post, and continuous minor tweaking of visual details and style. The current manual practice does not easily support this.

Additional context

Some small elaborations of point 1 are described here that are not necessary for the minimal viable product:

An auxiliary script can perform garbage collection when outdated image files are no longer references by markdown files in the image folder.
If the image after compression is small enough that it can fit into some small number of base64 characters (e.g. 400), it can be directly embedded in the markdown notebook via the "data" URL scheme (on a single line), avoiding the need for a separate file. This has several advantages, such as allowing the default rendering behavior of .md files within Github's diffing tools to include such images.
Pursuant to the earlier point, a heavily compressed, low-resolution version of the image could always be embedded via a "data" URL, to make diffs easier to evaluate and to allow a .md file to be self-contained in a way that makes it easier to share (albiet with lower visual quality for the associated images). Most editors are good at ignoring very long lines that contain embedded data in this way, but a definitely limit should be placed to ensure that if line-wrapping is enabled, the image does not contain more than e.g. a half-page of base64.
For code cells that we do not wish to be visible in the rendered markdown, the responsible code could be maintained directly an 'alt' tag of an embedded markdown image. This should be both readable to text editors and invisible in HTML rendering, as well as being machine-readable by the render script. In effect, WL just becomes a way to compactly and reproducibly embed a diagram in markdown (similar but more powerful than SVG) when this is desired.
If the file names for the exported files are based on hashes of the content, there can never be any confusion or inaccuracy of which image was intended to be reproduced.
All images that are produced in this programmatic function become semi-automatic unit tests of the SetReplace project, since if the pixel values of the image change, something is no longer behaving in a way that the author of the markdown document explicitly approved when writing it. Such changes will of course require human review, and small differences in pixel values might need to be ignored due to very minor changes in e.g. font rendering across platforms. Major differences between images (e.g. evaluated with L2 norm) definitely require human evaluation to decide if they represent a regression of some kind.
All research documents written in this way are visually reproducible: there is no opportunity for mistakes or omission, such as can happen when cells depend on the output of other cells and the author gets confused by a particular idiosyncratic evaluation order of cells. (Some particular Wolfram Physics bulletins spring to mind as being able to benefit from this more rigorous approach). If you can generate a research document with particular visual outputs, so can I, by simply running the same script. This does imply that a particular state must be maintained during the evaluation order, and cells must (probably) be evaluated top-to-bottom. This can be made smarter in various ways, such as by doing dependency tracking between the code in various cells and doing a topological sort before the global evaluation is performed -- this would make it an error to assign a global variable more than once, which is probably a good idea anyway. Environments like (observable)[https://observablehq.com] already use similar ideas.
This whole approach can be extended into a wider philosophy about "data artifacts", meaning objects that have the following features:
- require computation to produce
- can be produced reproducibly by any machine
- are too large to embed directly in a document, although their metadata and computed properties can be embedded
- are used to support scientific conclusions
- possibly depend on other data artifacts, and on code

A flexible representation of the computation graph that connects such objects and allows opportunistic caching, sharing via distributed protocols such as IPFS, etc, is long overdue and would revolutionize the way we conduct certain kinds of science. I see the image embedding story as the first step up such a ladder, since images are the simplest kind of "data artifacts" that are involved in this kind of work. I have much to say on this topic, but I think what I've written above gives an overview of the main ideas.

Let's split it at least into four separate issues (maybe more):

A function (probably PackageScope) that will take a held expression and produce the correctly formatted image with the output, together with the needed metadata (i.e., image width).
A Markdown linter that would go over all markdown documents (or, as an optimization, only modified ones from master), and:
- Detect (and correct with an -i flag) the formatting, e.g., line breaks, extraneous spaces, etc.
- Compute the images from all input code blocks (wl code with In[] := in the beginning), and verify the images in the corresponding images directories exist and are consistent. If images don't exist, it will generate them with an -i flag.
- Check for any unused images in the images directory, and delete them with an -i flag.
A script that is going to convert a Mathematica notebook to Markdown. Note, the perfectly good implementation of (2) is required for this because otherwise, this converted script will change the formatting of the parts of the Markdown file that have nothing to do with what's modified by the user.
A script that is going to convert a Markdown file to a Mathematica notebook. This, together with (3), will allow authoring notebooks directly in Mathematica.

Continue doing what we currently do. Personally I refer working in semi-WYSIWYG editors since they allow rapid prototyping of the 'flow' of a post, and continuous minor tweaking of visual details and style. The current manual practice does not easily support this.

Yes, we should definitely automate it as much as possible.

If the image after compression is small enough that it can fit into some small number of base64 characters (e.g. 400), it can be directly embedded in the markdown notebook via the "data" URL scheme (on a single line), avoiding the need for a separate file. This has several advantages, such as allowing the default rendering behavior of .md files within Github's diffing tools to include such images.

I don't think this is going to work. Most images are way over 400 characters limit (we want to have HiDPI images), and I don't think it's worth it to have a whole separate system just for a few below the limit.

Pursuant to the earlier point, a heavily compressed, low-resolution version of the image could always be embedded via a "data" URL, to make diffs easier to evaluate and to allow a .md file to be self-contained in a way that makes it easier to share (albiet with lower visual quality for the associated images). Most editors are good at ignoring very long lines that contain embedded data in this way, but a definitely limit should be placed to ensure that if line-wrapping is enabled, the image does not contain more than e.g. a half-page of base64.

Again, I don't think this makes sense. We definitely don't want to send low quality md files to people (given how easy it is to send them a link to Github instead). Plus, if we have the linter that will generate these images automatically anyway, we don't need to see them in the diffs at all. (The files under git ideally should not contain any auto-generated data.)

For code cells that we do not wish to be visible in the rendered markdown, the responsible code could be maintained directly an 'alt' tag of an embedded markdown image. This should be both readable to text editors and invisible in HTML rendering, as well as being machine-readable by the render script. In effect, WL just becomes a way to compactly and reproducibly embed a diagram in markdown (similar but more powerful than SVG) when this is desired.

alt tags are for images descriptions. They are not for metadata. Actually, we should start using them for accessibility reasons. The metadata about whether the code should be evaluated or not should go into the markdown comments.

If the file names for the exported files are based on hashes of the content, there can never be any confusion or inaccuracy of which image was intended to be reproduced.

Should we always seed the random number generated with the document path and the input code before generating the pictures for a particular document? Some images are random by design (e.g., "EventOrderingFunction" -> "Random" in the docs), and we probably don't want to put a seed by hand into each of those inputs.

All images that are produced in this programmatic function become semi-automatic unit tests of the SetReplace project, since if the pixel values of the image change, something is no longer behaving in a way that the author of the markdown document explicitly approved when writing it. Such changes will of course require human review, and small differences in pixel values might need to be ignored due to very minor changes in e.g. font rendering across platforms. Major differences between images (e.g. evaluated with L2 norm) definitely require human evaluation to decide if they represent a regression of some kind.

They can actually become fully automated unit tests because the image-generating technology should be in the linter, and CI runs the linter. Performance bothers me, though. Some images take a very long time to generate. We don't want to run them on CI for every commit. If we do the random seeding correctly, we should not need the L2 norm, they should just be the same, or the images should be updated in the PR that has changed them.

All research documents written in this way are visually reproducible: there is no opportunity for mistakes or omission, such as can happen when cells depend on the output of other cells and the author gets confused by a particular idiosyncratic evaluation order of cells. (Some particular Wolfram Physics bulletins spring to mind as being able to benefit from this more rigorous approach). If you can generate a research document with particular visual outputs, so can I, by simply running the same script. This does imply that a particular state must be maintained during the evaluation order, and cells must (probably) be evaluated top-to-bottom. This can be made smarter in various ways, such as by doing dependency tracking between the code in various cells and doing a topological sort before the global evaluation is performed -- this would make it an error to assign a global variable more than once, which is probably a good idea anyway. Environments like (observable)[https://observablehq.com] already use similar ideas.

I don't think we should overcomplicate things. I think it would be confusing to the reader if the notebook needs to be evaluated in a way other than top to bottom.

This whole approach can be extended into a wider philosophy about "data artifacts", meaning objects that have the following features:

require computation to produce

can be produced reproducibly by any machine

are too large to embed directly in a document, although their metadata and computed properties can be embedded

are used to support scientific conclusions

possibly depend on other data artifacts, and on code

A flexible representation of the computation graph that connects such objects and allows opportunistic caching, sharing via distributed protocols such as IPFS, etc, is long overdue and would revolutionize the way we conduct certain kinds of science. I see the image embedding story as the first step up such a ladder, since images are the simplest kind of "data artifacts" that are involved in this kind of work. I have much to say on this topic, but I think what I've written above gives an overview of the main ideas.

To do the smart caching, we will need to do a deep understanding of the code to know whether a particular PR could have changed the images. I imagine this will be quite difficult to do, but if done, it will allow us to run the entire thing on CI for every commit, which would be quite incredible indeed.

We discussed these same points in the planning meeting, and I agree with most your proposed alterations of my original proposal. Some differences remain:

Should we always seed the random number generated with the document path and the input code before generating the pictures for a particular document?

I don't see why we shouldn't just always seed with the number 1. If you rename or move the md file I don't think the images should all suddenly change. We could hash the code and us that as a seed, but I don't know exactly what that buys us. If you want to demonstrate that something is non-deterministic, do it with Row[Table[...]] and the RNG state will change within the loop.

They can actually become fully automated unit tests because the image-generating technology should be in the linter, and CI runs the linter. Performance bothers me, though. Some images take a very long time to generate. We don't want to run them on CI for every commit. If we do the random seeding correctly, we should not need the L2 norm, they should just be the same, or the images should be updated in the PR that has changed them.

We could specify a flag as part of the surrounding markdown comment to indicate that an image is slow to produce and should not be checked by the linter (unless a --slow argument is passed to the linter, we can do this less frequently, e.g. before releases).

As for the L2 norm, I'm not sure if my point was clear: images can differ slightly depending on the platform that produced them, because of minor platform inconsistencies. These differences are almost always very small. Hence, I'm proposing that the linter does not enforce a 100% match between the locally-regenerated image and the existing image, merely a 99.99% match. The 0.01% is quantified via something like image L2 norm.

To do the smart caching, we will need to do a deep understanding of the code to know whether a particular PR could have changed the images. I imagine this will be quite difficult to do, but if done, it will allow us to run the entire thing on CI for every commit, which would be quite incredible indeed.

This is indeed tricky when we have C++ and WL co-existing like this. The most conservative approach is that we hash the entire C++ codebase and build script arguments to produce the key that we use to decide staleness, so that if you change any C++ code, we assume that all images need to be regenerated. Anything finer grained than that is basically impossible for a language like C++. For Mathematica it is much more doable, since Definitions in GeneralUtilities will list the full set of values (OwnValues, SubValues, UpValues, FormatValues, Attributes, Options, etc) of a function. We repeat this process recursively on all symbols that occur within these definitions and that stay within the desired context (SetReplace in this case). This allows us to define a Merkle tree quite straightforwardly. The rest of the opportunistic caching requires a bit of thought about how what data-structures and metadata we need to store but the basic idea is pretty obvious I think.

I don't see why we shouldn't just always seed with the number 1. If you rename or move the md file I don't think the images should all suddenly change. We could hash the code and us that as a seed, but I don't know exactly what that buys us. If you want to demonstrate that something is non-deterministic, do it with Row[Table[...]] and the RNG state will change within the loop.

I agree about the file path. I still think we should hash the code. If we don't do that, we might get a false appearance of determinism which is not desirable. E.g., if someone has two code cells:

In[] := RandomInteger[10, 10]
Out[] = {5, 0, 1, 4, 2, 0, 3, 6, 8, 6}

and then

In[] := RandomInteger[10, 10]^2
Out[] = {4, 36, 49, 81, 36, 4, 81, 49, 4, 100}

it would be deceiving if the numbers were the same. If we hash the input code, this would not happen unless someone has exactly the same code cell appearing multiple times which would be quite unusual.

We could specify a flag as part of the surrounding markdown comment to indicate that an image is slow to produce and should not be checked by the linter (unless a --slow argument is passed to the linter, we can do this less frequently, e.g. before releases).

Good idea.

As for the L2 norm, I'm not sure if my point was clear: images can differ slightly depending on the platform that produced them, because of minor platform inconsistencies. These differences are almost always very small. Hence, I'm proposing that the linter does not enforce a 100% match between the locally-regenerated image and the existing image, merely a 99.99% match. The 0.01% is quantified via something like image L2 norm.

Sounds like a WL weed, but ok, if that does happen, we need a workaround.

This is indeed tricky when we have C++ and WL co-existing like this. The most conservative approach is that we hash the entire C++ codebase and build script arguments to produce the key that we use to decide staleness, so that if you change any C++ code, we assume that all images need to be regenerated. Anything finer grained than that is basically impossible for a language like C++. For Mathematica it is much more doable, since Definitions in GeneralUtilities will list the full set of values (OwnValues, SubValues, UpValues, FormatValues, Attributes, Options, etc) of a function. We repeat this process recursively on all symbols that occur within these definitions and that stay within the desired context (SetReplace in this case). This allows us to define a Merkle tree quite straightforwardly. The rest of the opportunistic caching requires a bit of thought about how what data-structures and metadata we need to store but the basic idea is pretty obvious I think.

That's a fascinating idea! If we can do that reliably, we can do it for tests as well, which would save us a lot of CI time.

maxitg / SetReplace