tidyverse / readr

Read flat files (csv, tsv, fwf) into R
https://readr.tidyverse.org
Other
1.01k stars 286 forks source link

read_csv() outputs the tibble multiplication sign as invalid UTF-8 character #547

Closed aalexandersson closed 7 years ago

aalexandersson commented 7 years ago

read_csv() version 1.0 outputs the tibble multiplication sign as invalid UTF-8 character. Typical example output are at http://readr.tidyverse.org/reference/read_delim.html :

> # A tibble: 32 × 11

The multiplication sign character in read_csv outputs such as above is extended ASCII but it should be either in plain ASCII or in Unicode UTF-8. In UTF-8 encoding, the character is displayed as xD7 but pandoc gives the error message

"Cannot decode byte '\xd7': Data.Text.Internal.Encoding.Fusion.streamUtf8: Invalid UTF-8 stream"

This is a problem for pandoc on Windows only. I tried pandoc version 1.13.1 and 1.18. I mentioned the problem on Statalist and wondered if it was a problem with Stata's user-written program "Markdoc", which is Stata's equivalent program to R Markdown. The user-programmer of MarkDoc concluded that read_csv should have avoided the invalid UTF-8 character, and I agree. The Statalist URL is http://www.statalist.org/forums/forum/general-stata-discussion/general/1355554-markdoc-manual-gui?p=1362612#post1362612

What is the rationale for using extended ASCII instead of plain ASCII or UTF-8 for the tibble multiplication sign? Given (1) the compatibility problems with pandoc on Windows and with dependent programs such as Stata's markdoc, (2) the no need for extended ASCII, and (3) having an obvious easy fix, I assume this issue was simply overlooked. The problem does not occur with R's read.csv, only with your read_csv (and partially also with fread).

hadley commented 7 years ago

This seems a like a tibble problem?