quarto-dev / quarto-cli

Open-source scientific and technical publishing system built on Pandoc.
https://quarto.org
Other
3.98k stars 328 forks source link

Trim blank line around raw block with table before doing HTML parsing #9867

Open cderv opened 5 months ago

cderv commented 5 months ago

From #9681, issue was fixed by removing the rawblock - though I understand now why there was a new slides I think

In this parsed HTML tables

```{=html}
<table>
 <thead>
  <tr>
   <th style="text-align:right;"> Sepal.Length </th>
   <th style="text-align:right;"> Sepal.Width </th>
  </tr>
 </thead>
<tbody>
  <tr>
   <td style="text-align:right;"> 5.1 </td>
   <td style="text-align:right;"> 3.5 </td>
  </tr>
</tbody>
</table>

There is a blank line in the raw HTML block. This is what caused the insertion of the html RawBlock in our processing 

And it seems it lead to a new slide in Powerpoint.

Just sharing this in case another fix would be to trim blank lines around html content inside a rawblock before parsing the HTML in our processing;

Originally posted by @cderv in https://github.com/quarto-dev/quarto-cli/issues/9681#issuecomment-2117086665

cscheid commented 5 months ago

Unfortunately, I don't think we can generally remove whitespace from HTML documents, because it has layout implications. That is, we can't look at

```{=html}


And choose to remove that.
cderv commented 5 months ago

Let me remind the context of this issue, so that we are on the same page.

The discussion at #9681 ended up at your comment https://github.com/quarto-dev/quarto-cli/issues/9681#issuecomment-2117775452

Just sharing this in case another fix would be to trim blank lines around html content inside a rawblock before parsing the HTML in our processing;

This is a good idea.

I opened this issue to track this idea.

So the idea initially shared by me was that when we parse HTML table with a blank like at top or bottom of the raw block , we end up in the AST with

  - t: "RawBlock"
    format: "html"
    text: "\n"

If we don't remove that (for non HTML format), then we rely on the fact that Pandoc writers do the right thing to ignore the RawBlock.

Though, it seems it was not correctly done in powerpoint, and this is what messed up Powerpoint output in #9681

It seems to me that trimming start and end of the HTML content of our parsed raw block was something we could do quite safely. But this is not something to do for all the RawBlock ! This idea here is only for our table parsing processing (handle_raw_html_as_table) where we do create raw block with the remaining content

In the example above, a new line so

  - t: "RawBlock"
    format: "html"
    text: "\n"

Anyhow, if this is not a good idea, I am happy to close. The issue is fixed for powerpoint. I just opened this following your comment in #9681 to try prevent any other issue like this by us producing this raw html block.

cscheid commented 5 months ago

Sorry, I should have been clearer.

I'm trying to say that, by itself, we can't simply remove a rawblock that has a single newline, because it's possible that someone added that rawblock with a newline for a reason (since it can affect HTML output).

In the context of our own processing, I do think it makes sense to not emit a rawblock with a single newline in the process of splitting rawblocks that we know we want to treat.

In other words, the fix shouldn't be to add a separate filter that drops all rawblocks that only have whitespace; the fix is to change the table processing code directly.

cderv commented 5 months ago

In other words, the fix shouldn't be to add a separate filter that drops all rawblocks that only have whitespace; the fix is to change the table processing code directly.

Oh perfect. We are on the same page ! 💯 agree with that, that was my plan.