quarto-dev / quarto-cli

Open-source scientific and technical publishing system built on Pandoc.
https://quarto.org
Other
3.74k stars 305 forks source link

Quarto discards complex HTML table elements #4419

Closed petrbouchal closed 1 year ago

petrbouchal commented 1 year ago

Bug description

I am trying to include the output of {pointblank} validation, which is a complex HTML table based on {gt}. A reprex qmd file looks like this:

---
title: "Untitled"
format: html
---

```{r setup}
library(pointblank)
mtcars |> 
  create_agent() |> 
  interrogate()

When rendered via `quarto render`, the result looks like this:

![image](https://user-images.githubusercontent.com/1666657/219859499-892ee40c-ebc3-4f39-b822-158f594e9180.png)

Which is different to what is rendered in Rmarkdown and in the quarto doc inline, e.g. the complex heading structure is reduced to a single div and other layout/style features are lost. More complex tables (with steps in the validation output) come out completely garbled.

However, when I set `keep_md: true` like so

````md
---
title: "Untitled"
format: 
  html:
    keep-md: true
---

then render the markdown doc via pandoc --to html test.md -o test.html, the HTML table in the output is correct:

image

Simple {gt} tables render correctly in the normal quarto rendering workflow.

This leads me to guess that the problem might be in the default filters that quarto applies during the pandoc rendering step.

Versions

Quarto version: 1.3.203

RStudio version 2023.03.0-daily+323 (2023.03.0-daily+323)

Session Info

R version 4.2.1 (2022-06-23)
Platform: aarch64-apple-darwin20 (64-bit)
Running under: macOS Ventura 13.2

Matrix products: default
LAPACK: /Library/Frameworks/R.framework/Versions/4.2-arm64/Resources/lib/libRlapack.dylib

locale:
[1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8

attached base packages:
[1] stats     graphics  grDevices utils     datasets  methods   base     

other attached packages:
[1] pointblank_0.11.3 reprex_2.0.2      devtools_2.4.3    usethis_2.1.6    

loaded via a namespace (and not attached):
 [1] pillar_1.8.1      compiler_4.2.1    prettyunits_1.1.1 remotes_2.4.2     tools_4.2.1       testthat_3.1.6   
 [7] digest_0.6.31     pkgbuild_1.3.1    pkgload_1.2.4     gtable_0.3.1      tibble_3.1.8      memoise_2.0.1    
[13] lifecycle_1.0.3   pkgconfig_2.0.3   rlang_1.0.6       DBI_1.1.3         cli_3.6.0         rstudioapi_0.14  
[19] xfun_0.36         fastmap_1.1.0     withr_2.5.0       dplyr_1.0.10      knitr_1.42        desc_1.4.2       
[25] generics_0.1.3    fs_1.6.1          vctrs_0.5.2       grid_4.2.1        tidyselect_1.2.0  rprojroot_2.0.3  
[31] prompt_1.0.1      glue_1.6.2        R6_2.5.1          processx_3.6.1    fansi_1.0.3       sessioninfo_1.2.2
[37] blastula_0.3.3    ggplot2_3.4.0     callr_3.7.0       purrr_1.0.1       magrittr_2.0.3    scales_1.2.1     
[43] ps_1.7.1          clisymbols_1.2.0  ellipsis_0.3.2    htmltools_0.5.4   rsconnect_0.8.27  gt_0.8.0         
[49] assertthat_0.2.1  colorspace_2.0-3  utf8_1.2.2        munsell_0.5.0     cachem_1.0.6      crayon_1.5.2     
[55] brio_1.1.3 

Checklist

cderv commented 1 year ago

@cscheid for this one I wonder if we are not handling correctly the parsing of this complex HTML table in our parse_html_tables().

I don't know well the internal for those recent changes but are we parsing all HTML table in raw block even when outputing a HTML format ?

https://github.com/quarto-dev/quarto-cli/blob/f00d8fd7fee31e2bf5d4e30ec81d87bb2e43479d/src/resources/filters/main.lua#L171

https://github.com/quarto-dev/quarto-cli/blob/f00d8fd7fee31e2bf5d4e30ec81d87bb2e43479d/src/resources/filters/normalize/parsehtml.lua#L6-L13

I was looking into that because I was curious of the new processing. And when comparing intermediate markdown and HTML output, the table part of the gt_table has completely dissapear from HTML output. It seem we are not parsing this complex raw HTML table as a Table in AST.

We have also tableRenderRawHtml() in a later place which has some filters regarding HTML like format https://github.com/quarto-dev/quarto-cli/blob/f00d8fd7fee31e2bf5d4e30ec81d87bb2e43479d/src/resources/filters/quarto-pre/table-rawhtml.lua#L47-L56

but this is too late - the raw HTML table was already processed in the other pre step.

Anyway, just some hints. @rich-iannone will obviously know more about the complexity with the HTML table from pointblank.

I am just curious to understand this processing better.

rich-iannone commented 1 year ago

@petrbouchal Sorry for the very late reply on this. With newer versions of Quarto, gt tables all render properly. There are some minor issues styles in the pointblank reporting tables in particular (screenshot attached of a recent render) but this is more to do with gt itself and some CSS-integration issues that pointblank could handle (mostly to do with the <code> text).

Here is my recent test with a bigger output table (to test more of what might go wrong) in the very latest (main) of Quarto:

---
format: html
---

```{r setup}
library(pointblank)
agent <-
  create_agent(
    tbl = ~ small_table,
    actions = action_levels(stop_at = 0.1)
  ) %>%
  col_vals_gt(
    vars(date_time), vars(date),
    na_pass = TRUE
  ) %>%
  col_vals_gt(
    vars(b), vars(g), na_pass = TRUE,
    label = "b > g"
  ) %>%
  col_is_character(
    vars(b, f),
    label = "Verifying character-type columns" 
  ) %>%
  rows_distinct(
    vars(d, e, f),
    label = "Distinct rows across 'd', 'e', and 'f'"
  ) %>%
  col_is_integer(
    vars(a),
    label = "`a` must be an integer",
    active = FALSE
  ) %>%
  interrogate()

agent


<img width="1017" alt="pointblank-table-quarto" src="https://github.com/quarto-dev/quarto-cli/assets/5612024/64c4784b-b1da-412b-9ebf-fb28f9eee871">
rich-iannone commented 1 year ago

Going to close this as I think this is resolved (i.e., you can render the output, though refinements need to be made by the developer of the {pointblank} and {gt} packages).