Improve convert image snapshots

jheer commented 2 years ago

There are some lingering issues with snapshots of page elements. The pdf output is inconsistent with png/jpg output. Each has strengths and weaknesses. Ideally we would get consistent output with all the strengths and none of the weaknesses...

The bitmap snapshots do not perform resizing so can capture unnecessary white space of parent container elements. The PDF output, in contrast, includes transformation of element sizes by re-styling block elements to use display: inline-block to ensure the element sizing is driven by child content. It also re-styles margins to avoid undesirable clipping.
The PDF output uses extracted HTML and styles; however, this is not sufficient to capture the page content, as the extracted HTML may not re-generate the correct page state. For example, an extracted HTML canvas will not include the canvas pixel content or rendering code, resulting in a blank canvas. Similarly, extracted form elements may not preserve the current values of the form input elements.

We could consider an alternative approach that uses the same preparations for both vector and bitmap outputs. We would want to load the page and capture the "live" page state. One idea is to inject JS code into the loaded page to change styles, hide non-snapshot content (e.g., display: none), and perform sizing / margin adjustments. We could then take a (bounding box cropped) PDF or bitmap screenshot. It would be ideal to avoid re-loading the page for each snapshot, so we could look at ways to apply and then undo such styling transformations. Either way, as a subsequent optimization a conversion plan might also generate a filtered AST with only the elements we want to snapshot, thereby avoiding processing and rendering all the other page contents.

@mathisonian Any reactions or other ideas?

mathisonian commented 2 years ago

@jheer hiding all of the non-snapshot content on the full page seems like a reasonable approach, and the ability to avoid page reloads is nice.

Does this solve the issue with canvases etc in PDF? Or would we need some logic to preserve that

mathisonian commented 2 years ago

I've pushed some initial work in https://github.com/uwdata/living-papers-testbed/pull/17 although I'm thinking through some concerns about preserving element styles when hiding non-snapshot content (more details in the PR)

uwdata / living-papers

Improve convert image snapshots #14