Better `span.markdown-embed` parity with native Obisidan print to pdf

dleeftink commented 5 months ago

This took a while to track down, after noticing my @media print { ... } css files where rendering differently using your plugin compared to Obsidian's native print functionality.

Issue: broken block contexts

A better (but still not ideal) solution to have span.markdown-embeds behave similarly to native Obsidian, requires nested block-level elements to be inserted after the document.body.innerHTML has been set in line 206. Namely, certain block level elements (divs, blockquotes, etc.) force a break in the block context of a parent element (most notably <p>) which messes up the resulting HTML if set using document.body.innerHTML:

<p>
  Hello
  <span class="internal-embed">
    <div>broken</div>
  </span>
  world
</p>

Leads to the following output with orphaned nodes when parsed by .innerHTML:

<p>
  Hello
  <span class="internal-embed"></span>
</p>
<div>broken</div>
world
<p></p>

A possible solution is base64 encoding the content of span.markdown-embed elements starting from the last element by changing

https://github.com/l1xnan/obsidian-better-export-pdf/blob/16229e564fd85a0c3a7764f8ed905e708aba8784/src/render.ts#L234-L246

to this encoding function and then recursively decoding the base64 encoded elements after document.body.innerHTML has been set:

https://github.com/l1xnan/obsidian-better-export-pdf/blob/16229e564fd85a0c3a7764f8ed905e708aba8784/src/modal.ts#L206

using this decoding function.

Workaround

This workaround maintains nested block level elements without forcing breaks in the block context of the parent element and messing up the layout. The following functions are added

encodeEmbeds(doc)
- Starts from the last element using [...document.querySelectorAll('span.markdown-embeds')].reverse();
- Encodes the innerHTML of span.markdown-embed elements using span.innerHTML = btoa(span.innerHTML)
decodeAndReplaceEmbed(element)
- Utility within the appendWebview function
- Decodes the innerHTML of span.markdown-embeds recursively

<??>

Note: the encodeEmbeds(doc) function may need to be rewritten to encode the deepest nested (sibling) elements first instead of starting with the last element. However, simply reversing the order of iteration seems fine in my testing, as deeply nested elements always come later when selected using document.querySelectorAll().

dleeftink commented 5 months ago

Before: broken blocks

betterPDF(24 04 18)nesting(issue)

After: original blocks

betterPDF(24 04 18)nesting(issue)fix

l1xnan commented 5 months ago

Before: broken blocks

After: original blocks

Can you compare it with the official export effect?

dleeftink commented 5 months ago

As, display without the css borders? Or the result from the 'Export/Debug' actions?

l1xnan commented 5 months ago

As, display without the css borders? Or the result from the 'Export/Debug' actions?

Use the official Export to PDF feature to generate the PDF, then compare the results with the current output. We aim to maintain consistency with the official rendering as much as possible.

dleeftink commented 5 months ago

Will do. Here is the prettified HTML from the debugger in the meantime (notice the difference at 'some trailing text'):

Before

<div class="print">
  <div class="markdown-preview-view markdown-rendered show-properties">
    <h3 data-heading="hello">
      hello<a href="af://h3-0" class="md-print-anchor"></a>
    </h3>
    \n
    <p>
      Paragraph 1 with
      <span
        alt="bpdf(nest) > Nested block"
        src="bpdf(nest)#Nested block"
        class="internal-embed markdown-embed inline-embed is-loaded"
      ></span>
    </p>
    <div class="markdown-embed-title"></div>
    <div class="markdown-preview-view markdown-rendered show-indentation-guide">
      <h2 data-heading="Nested block">
        Nested block<a href="af://h2-1" class="md-print-anchor"></a>
      </h2>
      \n
      <p>Embedded paragraph.</p>
    </div>
    some trailing text. <!--broken block-->
    <p></p>
    \n
    <p>Paragraph 2</p>
    \n
    <h2 data-heading="world">
      world<a href="af://h2-2" class="md-print-anchor"></a>
    </h2>
    \n
    <p>Paragraph 3</p>
  </div>
</div>

After

<div class="print">
  <div class="markdown-preview-view markdown-rendered show-properties">
    <h3 data-heading="hello">
      hello<a href="af://h3-0" class="md-print-anchor"></a>
    </h3>
    \n
    <p>
      Paragraph 1 with
      <span
        alt="bpdf(nest) > Nested block"
        src="bpdf(nest)#Nested block"
        class="internal-embed markdown-embed inline-embed is-loaded"
        ><div class="markdown-embed-title"></div>
        <div class="markdown-preview-view markdown-rendered show-indentation-guide">
          <h2 data-heading="Nested block">
            Nested block<a href="af://h2-1" class="md-print-anchor"></a>
          </h2>
          \n
          <p>Embedded paragraph.</p>
        </div></span
      >
      some trailing text. <!--original block-->
    </p>
    \n
    <p>Paragraph 2</p>
    \n
    <h2 data-heading="world">
      world<a href="af://h2-2" class="md-print-anchor"></a>
    </h2>
    \n
    <p>Paragraph 3</p>
  </div>
</div>

l1xnan commented 5 months ago

Great, could you please test non-ASCII characters?

dleeftink commented 5 months ago

Here is the PDF output:

Before

betterPDF(24 04 18)nesting(issue)native-vs-plugin(before)

After

betterPDF(24 04 18)nesting(issue)native-vs-plugin(after)

dleeftink commented 5 months ago

Great, could you please test non-ASCII characters?

I have added two btoa_utf8 and atob_utf8 utility functions to handle unicode characters.

betterPDF(24 04 18)nesting(issue)fix(unicode)

dleeftink commented 5 months ago

In general however, I think it would be better to move away from this.preview.executeJavascript(...) to build the PDF DOM. This would require a more significant rewrite along the lines of:


// Setting up containers
const printEl = document.body.createDiv("print");
const viewEl = printEl.createDiv({
  cls: "markdown-preview-view markdown-rendered " + cssclasses.join(" ")
});

...

// Rendering markdown
await MarkdownRenderer.render(app, lines.join("\n"), viewEl, file.path, comp);
await MarkdownRenderer.postProcess(...);

...

// Cloning elements                                  
const printDoc = document.implementation.createHTMLDocument("document");
printDoc.body.appendChild(printEl.cloneNode(true));

// Copy Obsidian styles
printDoc.head =  document.head.innerHTML;

// Preparing data transfer
const blob = new Blob([printDoc], {
  type: "text/html",
});

// Create object URL
const printUrl = window.URL.createObjectURL(blob);

...

// Transfer HTML to print preview
this.preview.loadUrl(printUrl);
this.preview.webContents.once("did-finish-load", async () => { 

  // Clean up after --> not sure URL's can be revoked from within another webview
  URL.revokeObjectURL(printUrl);
  ... additional logic ...

})

But even then, the issue of .innerHTML breaking the block context of nested blocks may persist. The reason why it works in native Obisidian is that embeds are inserted after the initial DOM has been laid out, which circumvents block context breakage. Whether this method used by the Obsidian devs follows the proper HTML5 spec is a discussion I'll leave for another day...

dleeftink commented 5 months ago

In fact, the way Obsidian handles internal embeds via nested divs is inherently incompatible with how pagedjs lays out pages, as its layouting algorithm requires HTML to adhere to the spec. Even when implementing a mechanism where block level elements are replaced with inline elements, this would result in user css styles likely becoming incompatible with the post-processed DOM.

I've been investigating integrating pagedjs and Obsidian for a while now, and am happy to draw up a roadmap or set of requirements if you're interested.

l1xnan commented 5 months ago

In fact, the way Obsidian handles internal embeds via nested divs is inherently incompatible with how pagedjs lays out pages, as its layouting algorithm requires HTML to adhere to the spec. Even when implementing a mechanism where block level elements are replaced with inline elements, this would result in user css styles likely becoming incompatible with the post-processed DOM.

I've been investigating integrating pagedjs and Obsidian for a while now, and am happy to draw up a roadmap or set of requirements if you're interested.

I initially attempted to integrate Paged.js and Obsidian, but failed (blank pages couldn't display). Do you have any good ideas?

dleeftink commented 5 months ago

I initially attempted to integrate Paged.js and Obsidian, but failed (blank pages couldn't display). Do you have any good ideas?

The latest beta has a lot of pagination fixes, but still has some issues when elements have display: none, overflow: hidden or nested <br> elements inside complex trees (e.g. tables). Generally, pagedjs works best with a 'flat' DOM structure where each element has an intrinsic width and height (hence why Obsidian embeds/code block replacements may cause issues).

However, the devs are quick to respond to your particular issues as I haven't encountered the blank page issue previously. My guess is that the chunker and paginator have not properly finished. It is also good practice to isolate the pagedJSPolyFill inside a webview or <iframe> and have it run unaffected by styles and scripts present in the parent context.

In my view, the biggest hurdle to make pagedJS work with Obsidian is agreeing on how nested markdown is transformed into a single document: how to signal which notes are chapters, sections or paragraphs? How is the table of contents generated? How to best collect footnotes/sidesnotes and multi-column markdown? Resolve inconsistent heading levels? Merge/split list and tables? Dataview elements? The list goes on.

My current approach is generating a DAG and an AST of each note in the DAG and have a list of custom markdown transformers that are able to handle various cases. Making this work for all edge cases is no small feat.

l1xnan / obsidian-better-export-pdf