quarto-dev / quarto-cli

Open-source scientific and technical publishing system built on Pandoc.
https://quarto.org
Other
3.77k stars 308 forks source link

gt table docx rendering produces uncorrect output #7151

Open xtimbeau opened 11 months ago

xtimbeau commented 11 months ago

Bug description

When rendering a qmd with a gt table and a table label, docx output is faulty, word send a message that there is a problem, and the table appears strangely formated in the docx. Removing the label frm the code chunk brings back rendering to normal.

Steps to reproduce

Faulty :

---
title: "gt table"
author : "moi"
format:
  docx: default
---
```{r, echo=FALSE, message=FALSE, warning=FALSE}
#| label: tbl-table1
library(tidyverse)
library(gt)
table <- tribble(~a, ~b, ~c,
                "text", 1, "Long text aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaargh",
                "others", 2, "Long text aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaargh") 
gt(table)

Working (but with no label) :
````qmd
---
title: "gt table"
author : "moi"
format:
  docx: default
---
```{r, echo=FALSE, message=FALSE, warning=FALSE}
library(tidyverse)
library(gt)
table <- tribble(~a, ~b, ~c,
                "text", 1, "Long text aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaargh",
                "others", 2, "Long text aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaargh") 
gt(table)

### Expected behavior

core'ct formating of the gt table in the docx, label working for cross ref and caption

### Actual behavior

Error in rendering, wrong formating

### Your environment

IDE RStudio 2023-09
Win 11
Quarto 1.4.398

### Quarto check output

````bash
Quarto 1.4.398
[>] Checking versions of quarto binary dependencies...
      Pandoc version 3.1.8: OK
      Dart Sass version 1.55.0: OK
      Deno version 1.33.4: OK
[>] Checking versions of quarto dependencies......OK
[>] Checking Quarto installation......OK
      Version: 1.4.398
      Path: C:\Users\timbe\AppData\Local\Programs\Quarto\bin
      CodePage: 1252

[>] Checking tools....................OK
      TinyTeX: (external install)
      Chromium: (not installed)

[>] Checking LaTeX....................OK
      Using: TinyTex
      Path: C:\Users\timbe\AppData\Roaming\TinyTeX\bin\windows\
      Version: 2023

[>] Checking basic markdown render....OK

[>] Checking Python 3 installation....OK
      Version: 3.11.2
      Path: C:/Users/timbe/AppData/Local/Programs/Python/Python311/python.exe
      Jupyter: 5.3.0
      Kernels: python3

(/) Checking Jupyter engine render....0.00s - Debugger warning: It seems that frozen modules are being used, which may
0.00s - make the debugger miss breakpoints. Please pass -Xfrozen_modules=off
0.00s - to python to disable frozen modules.
0.00s - Note: Debugging will proceed. Set PYDEVD_DISABLE_FILE_VALIDATION=1 to disable this validation.
(/) Checking Jupyter engine render....0.00s - Debugger warning: It seems that frozen modules are being used, which may
0.00s - make the debugger miss breakpoints. Please pass -Xfrozen_modules=off
0.00s - to python to disable frozen modules.
0.00s - Note: Debugging will proceed. Set PYDEVD_DISABLE_FILE_VALIDATION=1 to disable this validation.
[>] Checking Jupyter engine render....OK

[>] Checking R installation...........OK
      Version: 4.3.1
      Path: C:/PROGRA~1/R/R-43~1.1
      LibPaths:
        - C:/Users/timbe/RLibs/4.3
        - C:/Program Files/R/R-4.3.1/library
      knitr: 1.44
      rmarkdown: 2.25

[>] Checking Knitr engine render......OK
mcanouil commented 11 months ago

Thanks for the report! What's the version of gt you used here? Also don't mixed syntax to pass options. Quarto heavily recommends to use YAML style options.o

---
title: "gt table"
format: docx
---

```{r}
#| label: tbl-table1
#| echo: false
#| message: false
#| warning: false
gt::gt(tibble::tribble(
  ~a, ~b, ~c,
  "text", 1, "Long text aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaargh",
  "others", 2, "Long text aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaargh"
))
xtimbeau commented 11 months ago

Thanks, gt version is github dev version (0.9.0.9000) but I tried with gt 0.9 with same results. OK for the unmixed syntax. As a matter of fact I first tried to mix syntax to see if ther problem persisted (including putting the label in the header {}.

mcanouil commented 11 months ago

The issue not only the table, the whole thing produced bad Word document at least using the latest development version

image
xtimbeau commented 11 months ago

sure but if you click yes then you get a word doc with a bad table.

Le ven. 6 oct. 2023 à 19:55, Mickaël Canouil @.***> a écrit :

The issue not only the table, the whole thing produced bad Word document. [image: image] https://user-images.githubusercontent.com/8896044/273294518-d359364f-a2f2-40ae-b639-6022b97fd6ba.png

— Reply to this email directly, view it on GitHub https://github.com/quarto-dev/quarto-cli/issues/7151#issuecomment-1751191968, or unsubscribe https://github.com/notifications/unsubscribe-auth/ANA2KEJ37K56NZDCSTWCX2DX6BA2DAVCNFSM6AAAAAA5VLW2D6VHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMYTONJRGE4TCOJWHA . You are receiving this because you authored the thread.Message ID: @.***>

mcanouil commented 11 months ago

Well, if the document is malformed, there is no reason to expect the content to be good.

cderv commented 11 months ago

Removing the label frm the code chunk brings back rendering to normal.

It seems to me the gt output is exactly the same. Only change between two render is the #tbl-table1 on the intermediate cell div. This triggers the crossref code path, and it seems something is not expected.

@cscheid I wonder if we have something similar to figures where new cross ref system expect a certain Markdown structure and knitr output does not give you what is expected (or just it is not accounted for by the new crossref).

Let me know if I need to adjust anything on R side.

cscheid commented 11 months ago

I'm almost certain this is just a bug on the new crossref code. I'll handle it this week.

BMC1986 commented 11 months ago

Getting the same issue with Flexible

---
title: "FlextableBug"
format: docx
---

## Quarto

```{r}
#| label: tbl-Biology
#| tbl-cap: A nice caption
#| echo: false

library(flextable)

ft <- flextable(head(mtcars))
ft
rmflight commented 10 months ago

I'm experiencing the same bug with a recent quarto daily version (was hoping everything was fixed after closing of other issues I saw). In my experience, both the gt and flextable tables contents are fine, but something is messed with the formatting that Word is expecting, so it "fixes" it.

For example, here is a table that is "broken" by Word fixes, from including a label and caption:

mangled_table

And here is what it looks like without a label and caption:

word_fine

I have a repo with testing both flextable and gt here.

cderv commented 10 months ago

@rmflight I believe we are tracking that here

Something we do is creating a bad open xml content that triggers Words to "fix" the document. We are still trying to identify.

jkrumbiegel commented 7 months ago

I've been looking at generating tables for docx from Julia, and also noticed that my docx files became invalid when I added table captions to the generating cells. I looked in the generated xml and found that captions were implemented by nesting the content inside a table. This means my docx table ended up as the only content of a table cell object in the xml code. This led me to the following stackoverflow post https://stackoverflow.com/questions/4485225/openxml-nested-tables And according to that, a nested table object needs to be followed by an empty paragraph. Once I added that, Word opened the docx file correctly. The remaining issue was that somehow the inner table is very narrow and long, which might be caused by some style settings on the outer caption table.

mcanouil commented 7 months ago

@jkrumbiegel Please provide reproducible example of your case. Also your case might not be the same since you talked about tables generated with Julia.

Could you share a small self-contained "working" (reproducible) example to work with, i.e., a complete Quarto document or a Git repository? Thanks.

You can share a Quarto document using the following syntax, i.e., using more backticks than you have in your document (usually four ````).

````qmd
---
title: "Reproducible Quarto Document"
format: html
engine: knitr
---

This is a reproducible Quarto document using `format: html`.
It is written in Markdown and contains embedded R code.
When you run the code, it will produce a plot.

```{r}
plot(cars)

A placeholder image

The end.

jkrumbiegel commented 7 months ago

I was just trying to help the bug finding process in this thread, it could be that tables rendered to docx are invalid with captions because of the empty paragraph issue I mentioned above.

mcanouil commented 7 months ago

Sharing a reproducible example of your case can help to track down the exact root cause if it a cross engine issue.

PS: I am not sure a 13 years old thread is really up to date with Microsoft Word internal XML specification.

jkrumbiegel commented 7 months ago

Sorry, I don't think you understand what I was trying to do here. I came across this issue trying to find information on a related problem, and I just solved a very similar issue to the one described in this thread for myself (making tables spliced into docx files work with quarto table captions). That I did the generation of the table with Julia is kind of irrelevant. The missing empty paragraph in the openxml markup was the reason that Microsoft Word complained. So if that info helps you here, that's cool, if not, then that's ok as well.

mcanouil commented 7 months ago

I see. Indeed I did not understood that you did solve an issue and simply reporting how. Thanks for sharing!

lorenzoFabbri commented 6 months ago

Is there any update on this? Right now I need to upload the docx to Google Doc to be able to read it properly (the tables are well formatted).

mcanouil commented 6 months ago

Thank you for your interest in the issue.

There is no need to ask for update. Updates are provided when there are ones to provide, this can be comments or the issue being closed, as you can see neither of those things happened.

lorenzoFabbri commented 6 months ago

Thank you for your interest in the issue.

There is no need to ask for update.

Updates are provided when there are ones to provide, this can be comments or the issue being closed, as you can see neither of those things happened.

Okay thanks. Based also on how you reply to many other users, I see you keep being fairly passive aggressive.

mcanouil commented 6 months ago

I see you keep being fairly passive aggressive.

I apologise, there is no intent from me there (my intent was simply to be factual nothing more, nothing less), i.e., I am not a native English speaker, so it seems the tone is not correct unfortunately ...

cscheid commented 6 months ago

I see you keep being fairly passive aggressive.

That's not an acceptable comment. Please refer to our code of conduct https://github.com/quarto-dev/quarto-cli?tab=coc-ov-file#readme

To be specific, the comment you made is a personal attack: the comment refers to the person rather than a specific action.

lorenzoFabbri commented 6 months ago

I will follow Accepting responsibility and apologizing to those affected by our mistakes, and learning from the experience, but many times there was a luck of Demonstrating empathy and kindness toward other people.

schwa021 commented 6 months ago

This bug still exists. I'm going to try and give a simple reproducible example below. I'm sure I'll do something wrong and get the usual wrist-slap - but I'll try anyway.

---
title: "Trying to Make a Nice Table Using gt and MS Word"

format:
  docx:
    toc: false
    number-sections: true
    number-depth: 1
    highlight-style: github
    reference-doc: word-template.docx
    fig-dpi: 600

---

{{< pagebreak >}}

# Background  

Make a table. Try to reference it (@tbl-test). Then print it out

```{r}
#| label: tbl-test

library(tidyverse)
library(gt)

tbldat <- 
  mtcars %>% 
  select(c(cyl, hp, mpg)) %>%
  rownames_to_column() |> 
  rename(Car = rowname) |> 
  slice_sample(n=10)

tbl <- 
  tbldat %>% 
  gt() %>%
  tab_header(title = "Propensity Model Performance") %>% 
  tab_style(
    style = cell_text(align = "left"),
    locations = cells_title()
  ) %>%
  cols_label(
    cyl = "Cylinder",
    hp = "Horsepower",
    mpg = "*Miles* per Gallon",
    .fn = md
  ) %>%
  cols_align(
    align = "center",
    columns = c(cyl, hp)
  ) %>%
  tab_style(
    style = cell_text(weight = 625),
    locations = cells_column_labels()
  ) %>%
  tab_style(
    style = cell_text(style = "italic"),
    locations = cells_body(columns = Car)
  ) %>%
  tab_source_note(source_note = md("**AUC** - Area under Receiver Operating Characteristic Curve, **PPV** - Positive Predictive Value")) %>% 
  opt_stylize(3) |> 
  tab_options(
    table_body.hlines.style = "solid",
    table_body.hlines.width = 1,
    table_body.hlines.color = "#cccccc"
  )

tbl

The file renders, but when you try to open the document you get the following:

"Word found unreadable content in test.docx. Do you want to recover the contents of this document? If you trust the source of this document, click Yes."

image

If you click "Yes" (I trust myself so I did that) - then you get the following message:

"This document contains fields that may refer to other files. Do you want to update the fields in this document?"

image

I do! I do! So I click "Yes" again. This is getting exciting.

Now I get a popup message from word with the title "Show Repairs". It says "Errors were detected in this file, but Word was able to open the file by making the repairs listed below. Save the file to make the repairs permanent."

Oh, thank you word, for making those repairs. But alas, upon looking at the document. The repairs were 💩.

image

The table looks terrible. Unformatted. Squeezed. Etc...

image

UPDATE:

removing the chunk label lets word create a table, but it is not properly formatted and cannot be referenced in the text - so, that's not good.

image

Note that it should look like this: image

cscheid commented 6 months ago

No wrist slap, but rather a general reminder that if the issue is open, that means that we are aware that the bug is still happening.

schwa021 commented 6 months ago

No wrist slap, but rather a general reminder that if the issue is open, that means that we are aware that the bug is still happening.

Apologies. I now see that I should have realized this was still an open issue.

I confess to getting confused in some of the thread's back-and-forth, citing other issues, and so on. Also, it wasn't clear to me which bug we are actually talking about - there seem to be several. Removing the label gets rid of the weird MS Word message, but in neither case do you get a properly formatted table. I guess I'll look for that bug somewhere as well.

In any case, I appreciate the responsiveness and the package. There's no doubt that gt is fantastic in html, where I use it all the time. But it really doesn't seem to work well at all in word. Hopefully soon!!!

cderv commented 6 months ago

Removing the label gets rid of the weird MS Word message

By removing the label, it means that cross referencing processing won't apply on this table an non necessary, and this is the processing that creates some malformed openxml somewhere. So you don't get the message because no processing happens, but also cross referencing is not possible

but in neither case do you get a properly formatted table.

in your post (https://github.com/quarto-dev/quarto-cli/issues/7151#issuecomment-2007979180) Update part, you shared "Note that it should look like this" - how did you determined it should look like this ? Is this what you expect from your word-template.docx ? This could be another issue with Quarto, but it could also be related to reference doc configuration. You should try with bare pandoc to see if a markdown table is rendered through your expecting table. it is also possible that gt is outputting raw output for docx and that you won't get the style for your reference doc, but the style defined by gt.

Anyhow, for this one, if you think this is a bug - please to do open a new issue. Thank you !

schwa021 commented 6 months ago

@cderv, regarding my comment

Note that it should look like this

I was basing this on what the table looks like in html - which I guess is the "native" format.

Sorry if I don't use the exact right words (e.g. "native", "should look like"). I am just a simple researcher, not a computer scientist/software developer. I have been scolded several times for mentioning this apparently irrelevant fact - but I am trying to explain why I may appear confused at times, and pleading for patience and guidance.

For example, you wrote:

You should try with bare pandoc to see if a markdown table is rendered through your expecting table. it is also possible that gt is outputting raw output for docx and that you won't get the style for your reference doc, but the style defined by gt.

I don't actually know what a "bare pandoc" is. I am trying to follow the instructions in quarto for creating a docx (i.e., using a word template) and gt (i.e. for making a table look pretty). Maybe the two don't go together?

In any case, as the reprex shows, the table you get in MS Word looks bad (i.e., all formatting is lost). This has been my universal experience with gt --> docx. Basically, it appears to me that output to MS Word simply doesn't work - you get a very basic unformatted table (at best).

If you believe the issue of all formatting being lost is a "new" bug, I am happy to create another thread.

mcanouil commented 6 months ago

FYI @schwa021 There is no "native" format or more accurately, the native format is "native" which is the AST representation of a document, which is agnostic to the actual output format (you can try for yourself setting format: native to see what it is). Even if Quarto team is trying to get visually similar output across format, you should not expect that as LaTeX/PDF, Typst/PDF, Docs, HTML, etc are very different technologies/markups.

mcanouil commented 6 months ago

If you believe the issue of all formatting being lost is a "new" bug, I am happy to create another thread.

Remove all custom options such as reference-doc: word-template.docx which might not be correct, i.e., use only format: docx. If you replicate the issue, then open a new issue with a small reprex and without all custom stuff.

cderv commented 6 months ago

@schwa021 Don't feel sorry - I am just asking for clarification. Your answers are perfectly fine to me.I don't expect you to talk like a computer software developer. On the contrary, I am trying to understand your feedback as a simple user. So this is all good ! It is me who is using term not adequate in this conversion (I should have expected you to understand "bare pandoc" - let's forget that)

There are a lot of tools in the stacks (quarto, markdown, pandoc, R, gt, docx) so it hard sometimes to follow everything. I believe the confusion may come from how the tools are working.

Let me try to clarify

This has been my universal experience with gt --> docx. Basically, it appears to me that output to MS Word simply doesn't work - you get a very basic unformatted table (at best).

I don't think gt will by default give you same output styling in HTML and in docx. So you see a difference in Quarto output because of that. You will get the same difference in R Markdown when using gt.

So this difference and confusion in styling is from gt and quarto can't do anything really as gt is providing raw output (HTML code, or openxml code).

If you want exactly the same table as in HTML for docx output I think you will need to export as image and use that exported image in the docx. But this won't be a markdown table.

Otherwise, if docx is your primary format, you may need to consider other table package like flextable which could have more styling option you are looking for.

I believe there are currently some related issue about this at https://github.com/rstudio/gt/issues?q=is%3Aopen+label%3A%22Focus%3A+Word+Output%22+sort%3Aupdated-desc like

I hope this help understand. Sorry if I am still using too technical term. They should not be needed for such discussion.

Thanks a lot for your feedback as simple user BTW - we need those too !

schwa021 commented 6 months ago

@cderv, Thank you for the quick and clear reply. I think you may have explained a major misunderstanding I had.

When I read the gt documentation, and looked at the examples, there was a LOT of emphasis on formatting to create beautiful tables. In fact, when doing this for output to html - it is terrific. I love the tables that are produced.

Then I read that gt "supported" output to docx. I understood this to mean it would be like output to html (i.e., it would look nice). Apparently, my understanding was wrong. As you wrote below:

I don't think gt will by default give you same output styling in HTML and in docx. So you see a difference in Quarto output because of that. You will get the same difference in R Markdown when using gt.

So this difference and confusion in styling is from gt and quarto can't do anything really as gt is providing raw output (HTML code, or openxml code).

If you want exactly the same table as in HTML for docx output I think you will need to export as image and use that exported image in the docx. But this won't be a markdown table.

Otherwise, if docx is your primary format, you may need to consider other table package like flextable which could have more styling option you are looking for.

My new understanding is that I simply cannot make a nice looking docx table using gt without saving as an image. The issue with that is that many (most) scientific journals will not accept that. BTW - I would much rather be working in Latex/pdf, but I work in the medical field where, sadly, docx has 99.99% market penetration.

Thanks a lot for your feedback as simple user BTW - we need those too !

My general feedback is that the Posit products are amazing for me. I really appreciate the opportunity to use powerful and practical software like this for free. I have also found the help on these github pages to be fairly useful - though, as we have discussed here - sometimes I get lost in the technical terminology. But, that is a "me problem". I assume there is a different level/type of support for paying customers.

Perhaps the gt documentation could make it clear somewhere that you should not expect nice looking output in docx format.

lorenzoFabbri commented 6 months ago

@schwa021 To be completely honest, the tables I used to get with gt and docx, while not as nice as the HTML versions, were perfectly fine. And if I upload the docx to Google Drive and open the document, they are also rather acceptable. And I'll say more, when opening the docx with Libre Office in Ubuntu, they also look nice! I found this issue with Word just recently because my co-authors made me notice that the tables were completely messed up. So while I tend to agree that gt and docx might not work out perfectly, it used to work just fine.

schwa021 commented 6 months ago

@schwa021 To be completely honest, the tables I used to get with gt and docx, while not as nice as the HTML versions, were perfectly fine. And if I upload the docx to Google Drive and open the document, they are also rather acceptable. And I'll say more, when opening the docx with Libre Office in Ubuntu, they also look nice! I found this issue with Word just recently because my co-authors made me notice that the tables were completely messed up. So while I tend to agree that gt and docx might not work out perfectly, it used to work just fine.

I agree.

image

rich-iannone commented 6 months ago

@schwa021 this is getting maybe a bit off topic for Quarto but I invite you to open a new issue or discussion in the gt repo.

We’ve been working on making the non-HTML formats capture and display more of the styles declared but it’s an ongoing process (though we’ve essentially caught up with LaTeX, so this is happening).

sda030 commented 3 months ago

Sorry for crashing the discussion but after reading between the lines, I think it is worth recognizing that we as enthusiastic users are a bit like early adapters of the first (free) iPhone, nagging about that, despite the 100 new features, we wished there was just a better cable connector to a Windows PC, preferably yesterday because we started bragging about the new phone at work, haha. Kind of the pitfall of success. So thank you very much for your hard work guys (you too @rich-iannone).

I apologise, there is no intent from me there (my intent was simply to be factual nothing more, nothing less), i.e., I am not a native English speaker, so it seems the tone is not correct unfortunately ...

That made me reflect a lot. Easy to get caught up in a "nerdy debate". <3

olivroy commented 1 month ago

@rich-iannone Is there anything that can be done in gt to address this. A gt table saved to docx with gtsave() usually takes up the full width, while when outputting to Quarto, the table is squished.

Maybe we could check for recent quarto and tweak the openxml string as necessary?

This is https://github.com/rstudio/gt/issues/1679

sda030 commented 1 month ago

@olivroy, they are doing their best handling many issues - updates will come when issue is resolved. These comments take time away from coding.

I do understand your dispair though, and I personally is only waiting for this one limitation to gt/quarto. I could have financially rewarded a solution, paid by our company. But money and pressure is rarely what's needed, only time.

mcanouil commented 1 month ago

@sda030 You misunderstood the intent here. @olivroy contributed a lot lately to gt and is very likely asking how to help by suggesting an approach.

sda030 commented 1 month ago

Oh, my bad, sorry.