Errors when parsing markdown files

dieghernan commented 3 years ago

Hi,

Since the upgrade to v 1.9000.9000.9000 I see an issue on homepage (parsed from README):


#> HernangÃ³mez D (2021). _giscoR: Download Map Data from GISCO API

On previous versions it was "Hernangómez", without escaping. Also other characters as "©" now are parsed as "Â©".

https://ropengov.github.io/giscoR/

On v 1.6.1.9001. this was parsed correctly, see https://dieghernan.github.io/nominatimlite/


citation("nominatimlite")
#> 
#> To cite the 'nominatimlite' package in publications use:
#> 
#> Hernangómez D (2021). _nominatimlite: Interface with 'Nominatim' API
#> Service_. doi: 10.5281/zenodo.5113195 (URL:
#> https://doi.org/10.5281/zenodo.5113195), R package version 0.1.1, <URL:
#> https://dieghernan.github.io/nominatimlite/>.
#> 
#> A BibTeX entry for LaTeX users is
#> 
#>   @Manual{,
#>     title = {nominatimlite: Interface with 'Nominatim' API Service},
#>     year = {2021},
#>     note = {R package version 0.1.1},
#>     version = {0.1.1},
#>     author = {Diego Hernangómez},
#>     doi = {10.5281/zenodo.5113195},
#>     url = {https://dieghernan.github.io/nominatimlite/},
#>   }

dieghernan commented 3 years ago

It is not restricted to README, see https://ropengov.github.io/giscoR/LICENSE.html:

Version 3, 29 June 2007
Copyright Â© 2007 Free Software Foundation, Inc.Â <http://fsf.org/>

Everyone is permitted to copy and distribute verbatim copies of this license document, but changing it is not allowed.

Preamble
The GNU General Public License is a free, copyleft license for software and other kinds of works.

The licenses for most software and other practical works are designed to take away your freedom to share and change the works. By contrast, the GNU General Public License is intended to guarantee your freedom to share and change all versions of a programâto make sure it remains free software for all its users. We, the Free Software Foundation, use the GNU General Public License for most of our software; it applies also to any other work released this way by its authors. You can apply it to your programs, too.

...

For the developersâ and authorsâ protection, the GPL clearly explains that there is no warranty for this free software. For both usersâ and authorsâ sake, the GPL requires that modified versions be marked as changed, so that their problems will not be attributed erroneously to authors of previous versions.

maelle commented 3 years ago

The error is somewhere in markdown_path_html() :female_detective:

maelle commented 3 years ago

At

https://github.com/r-lib/pkgdown/blob/1803229326669a2734c5f9ad564a39f0012f6ded/R/markdown.R#L63

The signs are still fine.

After https://github.com/r-lib/pkgdown/blob/1803229326669a2734c5f9ad564a39f0012f6ded/R/markdown.R#L64

they are not.

Maybe the HTML should be read like in update_html().

maelle commented 3 years ago

Noting that update_html() via the functions it uses, assumes UTF-8, and if instead of xml <- xml2::read_html(html_path) above I add the same encoding then things look fine.

maelle commented 3 years ago

And it used to be the case in the Markdown transforming function.

https://github.com/r-lib/pkgdown/blob/ff06f4fb444ac4c7cc6219177d87986994026124/R/markdown.R#L27

hadley commented 3 years ago

@maelle do you minding sharing the reprex you presumably created so I can turn it into a test?

maelle commented 3 years ago

@hadley sorry, I simply cloned the repo mentioned in the report. :sweat_smile:

dieghernan commented 3 years ago

Hi @hadley, I prepared a reprex. It seems that moving xml2::read_html(html_path) to xml2::read_html(html_path, encoding = "UTF-8") as @maelle suggested may solve the issue. However encodings are tricky, and I don't have full knowledge of the implications on this (I hate encodings, by the way):

# Create markdown

tmpmd <- file.path(tempdir(), "temp.md")
file.create(tmpmd)
#> [1] TRUE

# Write my name
writeLines("Diego Hernangómez, © Eurostat", tmpmd)
text <- readLines(tmpmd)

# On markdown is ok
text
#> [1] "Diego Hernangómez, © Eurostat"

# Parse with pkgdown
# Now its wrong
pkgdown:::markdown_to_html(text)
#> {html_document}
#> <html>
#> [1] <body><p>Diego HernangÃ³mez, Â© Eurostat</p></body>

# Step by step: pkgdown:::markdown_to_html
# https://github.com/r-lib/pkgdown/blob/2720abc02fbddbb761104d44d30ce7a3d0c26812/R/markdown.R#L84-L96
# markdown_to_html <- function(text, dedent = 4) {
#   if (dedent) {
#     text <- gsub(paste0("($|\n)", strrep(" ", dedent)), "\\1", text, perl = TRUE)
#   }
#
#   md_path <- withr::local_tempfile()
#   html_path <- withr::local_tempfile()
#
#   write_lines(text, md_path)
#   convert_markdown_to_html(md_path, html_path)
#
#   xml2::read_html(html_path)
# }

dedent <- 4

# markdown_to_html <- function(text, dedent = 4) {
# Error here! dedent is integer, not logical
# if (dedent) {
text <- gsub(paste0("($|\n)", strrep(" ", dedent)), "\\1", text, perl = TRUE)
# }

text
#> [1] "Diego Hernangómez, © Eurostat"

md_path <- withr::local_tempfile()
#> Setting deferred event(s) on global environment.
#>   * Execute (and clear) with `withr::deferred_run()`.
#>   * Clear (without executing) with `withr::deferred_clear()`.
html_path <- withr::local_tempfile()

pkgdown:::write_lines(text, md_path)

pkgdown:::convert_markdown_to_html(md_path, html_path)

readLines(html_path)
#> [1] "<p>Diego HernangÃ³mez, Â© Eurostat</p>"

xml2::read_html(html_path)
#> {html_document}
#> <html>
#> [1] <body><p>Diego HernangÃ³mez, Â© Eurostat</p></body>

# And moving to this as Maelle suggested is ok
xml2::read_html(html_path, encoding = "UTF-8")
#> {html_document}
#> <html>
#> [1] <body><p>Diego Hernangómez, © Eurostat</p></body>

# }

sessionInfo()
#> R version 4.1.0 (2021-05-18)
#> Platform: x86_64-w64-mingw32/x64 (64-bit)
#> Running under: Windows 10 x64 (build 19042)
#> 
#> Matrix products: default
#> 
#> locale:
#> [1] LC_COLLATE=Spanish_Spain.1252  LC_CTYPE=Spanish_Spain.1252   
#> [3] LC_MONETARY=Spanish_Spain.1252 LC_NUMERIC=C                  
#> [5] LC_TIME=Spanish_Spain.1252    
#> 
#> attached base packages:
#> [1] stats     graphics  grDevices utils     datasets  methods   base     
#> 
#> loaded via a namespace (and not attached):
#>  [1] xml2_1.3.2               knitr_1.36               magrittr_2.0.1          
#>  [4] R.cache_0.15.0           rlang_0.4.11             fastmap_1.1.0           
#>  [7] fansi_0.5.0              stringr_1.4.0            styler_1.6.2            
#> [10] highr_0.9                tools_4.1.0              xfun_0.26               
#> [13] R.oo_1.24.0              utf8_1.2.2               withr_2.4.2             
#> [16] htmltools_0.5.2          ellipsis_0.3.2           yaml_2.2.1              
#> [19] digest_0.6.28            tibble_3.1.4             lifecycle_1.0.1         
#> [22] pkgdown_1.9000.9000.9000 crayon_1.4.1             purrr_0.3.4             
#> [25] R.utils_2.11.0           vctrs_0.3.8              fs_1.5.0                
#> [28] cachem_1.0.6             memoise_2.0.0            glue_1.4.2              
#> [31] evaluate_0.14            rmarkdown_2.11           reprex_2.0.1            
#> [34] stringi_1.7.4            compiler_4.1.0           pillar_1.6.3            
#> [37] backports_1.2.1          R.methodsS3_1.8.1        pkgconfig_2.0.3

^{Created on 2021-10-04 by the reprex package (v2.0.1)}

r-lib / pkgdown

Errors when parsing markdown files #1800