tidyverse / tibble

A modern re-imagining of the data frame
https://tibble.tidyverse.org/
Other
664 stars 131 forks source link

tibble multiplication sign is invalid UTF-8 character #216

Closed aalexandersson closed 7 years ago

aalexandersson commented 7 years ago

The tibble multiplication sign is an invalid UTF-8 character. Here is a typical example output from http://readr.tidyverse.org/reference/read_delim.html :

> # A tibble: 32 × 11

The multiplication sign character in read_csv outputs such as above is extended ASCII but it should be either in plain ASCII or in Unicode UTF-8. In UTF-8 encoding, the character is displayed as xD7 but pandoc gives the error message

"Cannot decode byte '\xd7': Data.Text.Internal.Encoding.Fusion.streamUtf8: Invalid UTF-8 stream"

This is a problem for pandoc on Windows only. I tried pandoc version 1.13.1 and 1.18. I mentioned the problem on Statalist and wondered if it was a problem with Stata's user-written program "Markdoc", which is Stata's equivalent program to R Markdown. The user-programmer of MarkDoc concluded that read_csv should have avoided the invalid UTF-8 character, and I agree. The Statalist URL is http://www.statalist.org/forums/forum/general-stata-discussion/general/1355554-markdoc-manual-gui?p=1362612#post1362612

What is the rationale for using extended ASCII instead of plain ASCII or UTF-8 for the tibble multiplication sign? Given (1) the compatibility problems with pandoc on Windows and with dependent programs such as Stata's markdoc, (2) the no need for extended ASCII, and (3) having an obvious easy fix, I assume this issue was simply overlooked. The problem occurs with R's read_csv () but in bug tidyverse/readr#547 hadley closed the bug and instead suggested this is a tibble problem.

aalexandersson commented 7 years ago

Fixed typo in last sentence: Changed wickham to hadley.

krlmlr commented 7 years ago

Thanks. Could you please post the .Rmd file you use for testing on Windows, just to make sure we're on the same page?

aalexandersson commented 7 years ago

Here is the .Rmd file:

title: "Test of tibble in R Markdown" author: "Anders Alexandersson" date: "January 25, 2017" output: html_document

knitr::opts_chunk$set(echo = TRUE)

System Information

This is some system information.

sessionInfo()
library(readr)
rmarkdown::pandoc_version()

Create tibble output

This creates some tibble output.

read_csv("auto.txt")

Test tibble output using pandoc

This is a test of tibble using pandoc. How to run pandoc from R? In Stata's Rcall command it is automated. To reproduce the error in R, I copy-paste the above output to Notepad, which defaults to Encoding ANSI. I save the filename as "output.txt". Then from the command prompt where Pandoc is installed, I typed

pandoc Markdown.txt -o Word.docx

I saved the error message as "error_message.png".

Screenshot of error message

The same problem by another user was also reported on Stack Exchange at http://stackoverflow.com/questions/26492750/using-imported-utf-8-character-in-knitr-with-r

Here is the error message: error_message

aalexandersson commented 7 years ago

I am not allowed to paste HTML output. For you to see my R output, I attach the PDF output. test_tibble.pdf

thibautjombart commented 7 years ago

I can confirm this bug. The 'x' is the culprit. Here is a short Rnw with reproducible example:

\documentclass{article}
\usepackage[utf8]{inputenc}

\begin{document}

This will generate an error when compiling the \texttt{tex}.

<<test>>=
library(tibble)
as_tibble(cars)
@ 

The error I get on linux, and some colleagues on Mac, is:
\begin{verbatim}
> knit2pdf("test.Rnw")

processing file: test.Rnw
  |......................                                           |  33%
  ordinary text without R code

  |...........................................                      |  67%
label: test
  |.................................................................| 100%
  ordinary text without R code

output file: test.tex

Error in texi2dvi(file = file, pdf = TRUE, clean = clean, quiet = quiet,  : 
  Running 'texi2dvi' on 'test.tex' failed.
LaTeX errors:
! Package inputenc Error: Unicode char \u8:× not set up for use with LaTeX.

See the inputenc package documentation for explanation.
Type  H <return>  for immediate help.
 ...      
\end{verbatim}

\end{document}
NikNakk commented 7 years ago

On Windows 10, R 3.3.3, rmarkdown 1.4, tibble 1.3.0.9000 I am unable to reproduce this with either Rmd or Rnw. However, if I use rmarkdown::render("file", clean = FALSE) and use the non-UTF8 Md file of the two generated, I can get pandoc to produce the error indicated. There doesn't, however, seem to be anything wrong as such with the code in tibble.

krlmlr commented 7 years ago

@yihui: Is there a way to determine the expected encoding for console output for a knitr or rmarkdown run? Or do we just assume UTF-8?

tibble is printing a multiplication sign which requires Unicode and seems to break knitr documents in some cases.

thibautjombart commented 7 years ago

The weird thing is that my system is using utf8, and other non-ascii characters seem to do just fine. In the example provided the encoding is declared when loading the inputenc package in the LaTeX header (\usepackage[utf8]{inputenc}).

yihui commented 7 years ago

I received a similar report recently about the multiplication sign: https://github.com/yihui/knitr/issues/1389 but I could not reproduce it on Windows.

I guess @thibautjombart's problem is that he didn't tell knitr the encoding was supposed to be UTF-8 (which is the default on *nix but not Windows): knit2pdf("test.Rnw", encoding = "UTF-8").

I'd recommend that you just use the letter x instead of the fancy Unicode character... Character encoding problems on Windows are forever pain.

krlmlr commented 7 years ago

@hadley: Okay to revert to plain ASCII x?

thibautjombart commented 7 years ago

@yihui nope, my native encoding is utf-8 (I'm on linux). Adding the option hasn't changed the error. I can reproduce the error on the current rocker/verse docker image too:

File toto.Rnw saved
root@0aee4758d237:~# R

R version 3.3.3 (2017-03-06) -- "Another Canoe"
Copyright (C) 2017 The R Foundation for Statistical Computing
Platform: x86_64-pc-linux-gnu (64-bit)

R is free software and comes with ABSOLUTELY NO WARRANTY.
You are welcome to redistribute it under certain conditions.
Type 'license()' or 'licence()' for distribution details.

R is a collaborative project with many contributors.
Type 'contributors()' for more information and
'citation()' on how to cite R or R packages in publications.

Type 'demo()' for some demos, 'help()' for on-line help, or
'help.start()' for an HTML browser interface to help.
Type 'q()' to quit R.

> knitr::knit2pdf("toto.Rnw")

processing file: toto.Rnw
  |......................                                           |  33%
  ordinary text without R code

  |...........................................                      |  67%
label: test
  |.................................................................| 100%
  ordinary text without R code

output file: toto.tex

Error in texi2dvi(file = file, pdf = TRUE, clean = clean, quiet = quiet,  : 
  Running 'texi2dvi' on 'toto.tex' failed.
LaTeX errors:
! Package inputenc Error: Unicode char \u8:× not set up for use with LaTeX.

See the inputenc package documentation for explanation.
Type  H <return>  for immediate help.
 ...                                              
> 

Also note this character is used in the print method for tibble object. I am not using it otherwise.

thibautjombart commented 7 years ago

For what it's worth, this is what emacs thinks of this character:

             position: 1 of 2 (0%), column: 0
            character: × (displayed as ×) (codepoint 215, #o327, #xd7)
    preferred charset: unicode (Unicode (ISO10646))
code point in charset: 0xD7
               script: latin
               syntax: _    which means: symbol
             category: .:Base, c:Chinese, h:Korean, j:Japanese, l:Latin
             to input: type "C-x 8 RET HEX-CODEPOINT" or "C-x 8 RET NAME"
          buffer code: #xC3 #x97
            file code: #xC3 #x97 (encoded by coding system utf-8-unix)
              display: by this font (glyph code)
    xft:-PfEd-DejaVu Sans Mono-normal-normal-normal-*-19-*-*-*-m-0-iso10646-1 (#x99)

Seems like a valid utf8 character to my (naive) eye..

hadley commented 7 years ago

@krlmlr yeah, it's not worth the hassle.

github-actions[bot] commented 3 years ago

This old thread has been automatically locked. If you think you have found something related to this, please open a new issue and link to this old issue if necessary.