r-lib / urlchecker

Run CRAN URL checks from older versions of R
https://urlchecker.r-lib.org/
GNU General Public License v3.0
45 stars 5 forks source link

urlchecker does not check URLs in plain text in Rmd vignettes #4

Closed maelle closed 3 years ago

maelle commented 3 years ago

Example: https://github.com/maelle/example See https://github.com/maelle/example/blob/8dea0ff13c7ad607d329838ec682e962415c3636/vignettes/lala.Rmd#L10 that is converted to an actual link when knitting, but not by the conversion done here.

devtools::check(
  "/home/maelle/Documents/R-hub/example",
  manual = TRUE,
  remote = TRUE,
  incoming = TRUE
  )
#> Updating example documentation
#> Loading example
#> Warning: [/home/maelle/Documents/R-hub/example/R/foo.R:8] @examples requires a
#> value
#> Warning: The existing 'NAMESPACE' file was not generated by roxygen2, and will
#> not be overwritten.
#> ── Building ───────────────────────────────────────────────────────── example ──
#> Setting env vars:
#> ● CFLAGS    : -Wall -pedantic
#> ● CXXFLAGS  : -Wall -pedantic
#> ● CXX11FLAGS: -Wall -pedantic
#> ────────────────────────────────────────────────────────────────────────────────
#>      checking for file ‘/home/maelle/Documents/R-hub/example/DESCRIPTION’ ...  ✓  checking for file ‘/home/maelle/Documents/R-hub/example/DESCRIPTION’
#>   ─  preparing ‘example’:
#>      checking DESCRIPTION meta-information ...  ✓  checking DESCRIPTION meta-information
#>   ─  installing the package to build vignettes
#>      creating vignettes ...  ✓  creating vignettes (1.2s)
#>   ─  checking for LF line-endings in source and make files and shell scripts
#>   ─  checking for empty or unneeded directories
#>   ─  building ‘example_0.1.0.tar.gz’
#>      
#> ── Checking ───────────────────────────────────────────────────────── example ──
#> Setting env vars:
#> ● _R_CHECK_CRAN_INCOMING_USE_ASPELL_: TRUE
#> ● _R_CHECK_CRAN_INCOMING_REMOTE_    : TRUE
#> ● _R_CHECK_CRAN_INCOMING_           : TRUE
#> ● _R_CHECK_FORCE_SUGGESTS_          : FALSE
#> ● NOT_CRAN                          : true
#> ── R CMD check ─────────────────────────────────────────────────────────────────
#> * using log directory ‘/tmp/RtmpR7Gepa/example.Rcheck’
#> * using R version 4.0.2 (2020-06-22)
#> * using platform: x86_64-pc-linux-gnu (64-bit)
#> * using session charset: UTF-8
#> * using option ‘--as-cran’
#> * checking for file ‘example/DESCRIPTION’ ... OK
#> * checking extension type ... Package
#> * this is package ‘example’ version ‘0.1.0’
#> * package encoding: UTF-8
#> * checking CRAN incoming feasibility ... NOTE
#> Maintainer: ‘The package maintainer <yourself@somewhere.net>’
#> 
#> New submission
#> 
#> Found the following (possibly) invalid URLs:
#>   URL: https://masalmon.eu/405
#>     From: inst/doc/lala.html
#>     Status: 404
#>     Message: Not Found
#>   URL: https://masalmon.eu/406
#>     From: man/foo.Rd
#>     Status: 404
#>     Message: Not Found
#> 
#> DESCRIPTION fields with placeholder content:
#>   Title: what the package does (title case)
#>   Author: Who wrote it
#>   Maintainer: The package maintainer <yourself@somewhere.net>
#>   Description: more about what it does (maybe more than one line) use
#>     four spaces when indenting paragraphs within the description.
#> * checking package namespace information ... OK
#> * checking package dependencies ... OK
#> * checking if this is a source package ... OK
#> * checking if there is a namespace ... OK
#> * checking for executable files ... OK
#> * checking for hidden files and directories ... OK
#> * checking for portable file names ... OK
#> * checking for sufficient/correct file permissions ... OK
#> * checking serialization versions ... OK
#> * checking whether package ‘example’ can be installed ... OK
#> * checking installed package size ... OK
#> * checking package directory ... OK
#> * checking for future file timestamps ... OK
#> * checking ‘build’ directory ... OK
#> * checking DESCRIPTION meta-information ... OK
#> * checking top-level files ... NOTE
#> Non-standard file/directory found at top level:
#>   ‘docs’
#> * checking for left-over files ... OK
#> * checking index information ... OK
#> * checking package subdirectories ... OK
#> * checking R files for non-ASCII characters ... OK
#> * checking R files for syntax errors ... OK
#> * checking whether the package can be loaded ... OK
#> * checking whether the package can be loaded with stated dependencies ... OK
#> * checking whether the package can be unloaded cleanly ... OK
#> * checking whether the namespace can be loaded with stated dependencies ... OK
#> * checking whether the namespace can be unloaded cleanly ... OK
#> * checking loading without being on the library search path ... OK
#> * checking use of S3 registration ... OK
#> * checking dependencies in R code ... OK
#> * checking S3 generic/method consistency ... OK
#> * checking replacement functions ... OK
#> * checking foreign function calls ... OK
#> * checking R code for possible problems ... OK
#> * checking Rd files ... OK
#> * checking Rd metadata ... OK
#> * checking Rd line widths ... OK
#> * checking Rd cross-references ... OK
#> * checking for missing documentation entries ... OK
#> * checking for code/documentation mismatches ... OK
#> * checking Rd \usage sections ... OK
#> * checking Rd contents ... OK
#> * checking for unstated dependencies in examples ... OK
#>  WARNING
#> ‘qpdf’ is needed for checks on size reduction of PDFs
#> * checking installed files from ‘inst/doc’ ... OK
#> * checking files in ‘vignettes’ ... OK
#> * checking examples ... OK
#> * checking for unstated dependencies in vignettes ... OK
#> * checking package vignettes in ‘inst/doc’ ... OK
#> * checking re-building of vignette outputs ... OK
#> * checking PDF version of manual ... WARNING
#> LaTeX errors when creating PDF version.
#> This typically indicates Rd problems.
#> * checking PDF version of manual without hyperrefs or index ... OK
#> * checking for non-standard things in the check directory ... NOTE
#> Found the following files/directories:
#>   ‘example-manual.tex’
#> * checking for detritus in the temp directory ... OK
#> * DONE
#> 
#> Status: 2 WARNINGs, 3 NOTEs
#> See
#>   ‘/tmp/RtmpR7Gepa/example.Rcheck/00check.log’
#> for details.
#> 
#> 
#> ── R CMD check results ────────────────────────────────────── example 0.1.0 ────
#> Duration: 26s
#> 
#> > checking for unstated dependencies in examples ... OK
#>    WARNING
#>   ‘qpdf’ is needed for checks on size reduction of PDFs
#> 
#> > checking PDF version of manual ... WARNING
#>   LaTeX errors when creating PDF version.
#>   This typically indicates Rd problems.
#> 
#> > checking CRAN incoming feasibility ... NOTE
#>   Maintainer: ‘The package maintainer <yourself@somewhere.net>’
#>   
#>   New submission
#>   
#>   Found the following (possibly) invalid URLs:
#>     URL: https://masalmon.eu/405
#>       From: inst/doc/lala.html
#>       Status: 404
#>       Message: Not Found
#>     URL: https://masalmon.eu/406
#>       From: man/foo.Rd
#>       Status: 404
#>       Message: Not Found
#>   
#>   DESCRIPTION fields with placeholder content:
#>     Title: what the package does (title case)
#>     Author: Who wrote it
#>     Maintainer: The package maintainer <yourself@somewhere.net>
#>     Description: more about what it does (maybe more than one line) use
#>       four spaces when indenting paragraphs within the description.
#> 
#> > checking top-level files ... NOTE
#>   Non-standard file/directory found at top level:
#>     ‘docs’
#> 
#> > checking for non-standard things in the check directory ... NOTE
#>   Found the following files/directories:
#>     ‘example-manual.tex’
#> 
#> 0 errors ✓ | 2 warnings x | 3 notes x
#> Error: R CMD check found WARNINGs

urlchecker::url_check("/home/maelle/Documents/R-hub/example")
#> fetching [ 0 / 2 ]fetching [ 1 / 2 ]                       processing [ 0 / 2 ]processing [ 1 / 2 ]                         
#> x Error: man/foo.Rd:13:11 404: Not Found
#> See \href{https://masalmon.eu/406}{something} and \samp{localhost:1313}.
#>           ^~~~~~~~~~~~~~~~~~~~~~~

Created on 2020-11-20 by the reprex package (v0.3.0.9001)

maelle commented 3 years ago

I was hoping it might be a Pandoc argument but I can't find any that would transform plain http into links. There's a regex somewhere, but not sure where (Pandoc, rmarkdown).

maelle commented 3 years ago

If you knit an R Markdown file to html_document/pdf_document, plain URLs are linked.

maelle commented 3 years ago

aaaah so the autolinking comes from an option called... autolink in the markdown package.

maelle commented 3 years ago

From https://github.com/rstudio/markdown/blob/5abbfaec56cabf1b59ee5d72640d83b30dd71172/inst/examples/markdownExtensions.R#L41

cat(markdown::markdownToHTML(text = "https://www.r-project.org/", extensions = c()))
#> <!DOCTYPE html>
#> <html>
#> <head>
#> <meta http-equiv="Content-Type" content="text/html; charset=utf-8"/>
#> 
#> <title></title>
#> 
#> <script type="text/javascript">
#> window.onload = function() {
#>   var imgs = document.getElementsByTagName('img'), i, img;
#>   for (i = 0; i < imgs.length; i++) {
#>     img = imgs[i];
#>     // center an image if it is the only element of its parent
#>     if (img.parentElement.childElementCount === 1)
#>       img.parentElement.style.textAlign = 'center';
#>   }
#> };
#> </script>
#> 
#> 
#> 
#> 
#> 
#> <style type="text/css">
#> body, td {
#>    font-family: sans-serif;
#>    background-color: white;
#>    font-size: 13px;
#> }
#> 
#> body {
#>   max-width: 800px;
#>   margin: auto;
#>   padding: 1em;
#>   line-height: 20px;
#> }
#> 
#> tt, code, pre {
#>    font-family: 'DejaVu Sans Mono', 'Droid Sans Mono', 'Lucida Console', Consolas, Monaco, monospace;
#> }
#> 
#> h1 {
#>    font-size:2.2em;
#> }
#> 
#> h2 {
#>    font-size:1.8em;
#> }
#> 
#> h3 {
#>    font-size:1.4em;
#> }
#> 
#> h4 {
#>    font-size:1.0em;
#> }
#> 
#> h5 {
#>    font-size:0.9em;
#> }
#> 
#> h6 {
#>    font-size:0.8em;
#> }
#> 
#> a:visited {
#>    color: rgb(50%, 0%, 50%);
#> }
#> 
#> pre, img {
#>   max-width: 100%;
#> }
#> pre {
#>   overflow-x: auto;
#> }
#> pre code {
#>    display: block; padding: 0.5em;
#> }
#> 
#> code {
#>   font-size: 92%;
#>   border: 1px solid #ccc;
#> }
#> 
#> code[class] {
#>   background-color: #F8F8F8;
#> }
#> 
#> table, td, th {
#>   border: none;
#> }
#> 
#> blockquote {
#>    color:#666666;
#>    margin:0;
#>    padding-left: 1em;
#>    border-left: 0.5em #EEE solid;
#> }
#> 
#> hr {
#>    height: 0px;
#>    border-bottom: none;
#>    border-top-width: thin;
#>    border-top-style: dotted;
#>    border-top-color: #999999;
#> }
#> 
#> @media print {
#>    * {
#>       background: transparent !important;
#>       color: black !important;
#>       filter:none !important;
#>       -ms-filter: none !important;
#>    }
#> 
#>    body {
#>       font-size:12pt;
#>       max-width:100%;
#>    }
#> 
#>    a, a:visited {
#>       text-decoration: underline;
#>    }
#> 
#>    hr {
#>       visibility: hidden;
#>       page-break-before: always;
#>    }
#> 
#>    pre, blockquote {
#>       padding-right: 1em;
#>       page-break-inside: avoid;
#>    }
#> 
#>    tr, img {
#>       page-break-inside: avoid;
#>    }
#> 
#>    img {
#>       max-width: 100% !important;
#>    }
#> 
#>    @page :left {
#>       margin: 15mm 20mm 15mm 10mm;
#>    }
#> 
#>    @page :right {
#>       margin: 15mm 10mm 15mm 20mm;
#>    }
#> 
#>    p, h2, h3 {
#>       orphans: 3; widows: 3;
#>    }
#> 
#>    h2, h3 {
#>       page-break-after: avoid;
#>    }
#> }
#> </style>
#> 
#> 
#> 
#> </head>
#> 
#> <body>
#> <p>https://www.r-project.org/</p>
#> 
#> </body>
#> 
#> </html>
cat(markdown::markdownToHTML(text = "https://www.r-project.org/", extensions = c("autolink")))
#> <!DOCTYPE html>
#> <html>
#> <head>
#> <meta http-equiv="Content-Type" content="text/html; charset=utf-8"/>
#> 
#> <title></title>
#> 
#> <script type="text/javascript">
#> window.onload = function() {
#>   var imgs = document.getElementsByTagName('img'), i, img;
#>   for (i = 0; i < imgs.length; i++) {
#>     img = imgs[i];
#>     // center an image if it is the only element of its parent
#>     if (img.parentElement.childElementCount === 1)
#>       img.parentElement.style.textAlign = 'center';
#>   }
#> };
#> </script>
#> 
#> 
#> 
#> 
#> 
#> <style type="text/css">
#> body, td {
#>    font-family: sans-serif;
#>    background-color: white;
#>    font-size: 13px;
#> }
#> 
#> body {
#>   max-width: 800px;
#>   margin: auto;
#>   padding: 1em;
#>   line-height: 20px;
#> }
#> 
#> tt, code, pre {
#>    font-family: 'DejaVu Sans Mono', 'Droid Sans Mono', 'Lucida Console', Consolas, Monaco, monospace;
#> }
#> 
#> h1 {
#>    font-size:2.2em;
#> }
#> 
#> h2 {
#>    font-size:1.8em;
#> }
#> 
#> h3 {
#>    font-size:1.4em;
#> }
#> 
#> h4 {
#>    font-size:1.0em;
#> }
#> 
#> h5 {
#>    font-size:0.9em;
#> }
#> 
#> h6 {
#>    font-size:0.8em;
#> }
#> 
#> a:visited {
#>    color: rgb(50%, 0%, 50%);
#> }
#> 
#> pre, img {
#>   max-width: 100%;
#> }
#> pre {
#>   overflow-x: auto;
#> }
#> pre code {
#>    display: block; padding: 0.5em;
#> }
#> 
#> code {
#>   font-size: 92%;
#>   border: 1px solid #ccc;
#> }
#> 
#> code[class] {
#>   background-color: #F8F8F8;
#> }
#> 
#> table, td, th {
#>   border: none;
#> }
#> 
#> blockquote {
#>    color:#666666;
#>    margin:0;
#>    padding-left: 1em;
#>    border-left: 0.5em #EEE solid;
#> }
#> 
#> hr {
#>    height: 0px;
#>    border-bottom: none;
#>    border-top-width: thin;
#>    border-top-style: dotted;
#>    border-top-color: #999999;
#> }
#> 
#> @media print {
#>    * {
#>       background: transparent !important;
#>       color: black !important;
#>       filter:none !important;
#>       -ms-filter: none !important;
#>    }
#> 
#>    body {
#>       font-size:12pt;
#>       max-width:100%;
#>    }
#> 
#>    a, a:visited {
#>       text-decoration: underline;
#>    }
#> 
#>    hr {
#>       visibility: hidden;
#>       page-break-before: always;
#>    }
#> 
#>    pre, blockquote {
#>       padding-right: 1em;
#>       page-break-inside: avoid;
#>    }
#> 
#>    tr, img {
#>       page-break-inside: avoid;
#>    }
#> 
#>    img {
#>       max-width: 100% !important;
#>    }
#> 
#>    @page :left {
#>       margin: 15mm 20mm 15mm 10mm;
#>    }
#> 
#>    @page :right {
#>       margin: 15mm 10mm 15mm 20mm;
#>    }
#> 
#>    p, h2, h3 {
#>       orphans: 3; widows: 3;
#>    }
#> 
#>    h2, h3 {
#>       page-break-after: avoid;
#>    }
#> }
#> </style>
#> 
#> 
#> 
#> </head>
#> 
#> <body>
#> <p><a href="https://www.r-project.org/">https://www.r-project.org/</a></p>
#> 
#> </body>
#> 
#> </html>

Created on 2020-11-20 by the reprex package (v0.3.0.9001)

maelle commented 3 years ago

So maybe it'd make sense to use something like commonmark::markdown_html("https://masalmon.eu", extensions = TRUE) instead of the tools conversion?

maelle commented 3 years ago

that is, if one assumes most vignettes will be knitr+rmarkdown vignettes.

maelle commented 3 years ago

For the record I said something wrong, the autolinking in built vignettes comes from a Pandoc extension https://rmarkdown.rstudio.com/html_fragment_format.html#Markdown_Extensions

jimhester commented 3 years ago

Do you do get CRAN failures with this code?

maelle commented 3 years ago

well the NOTE is the same as for other invalid URLs so I guess it can be problematic? I heard of this via https://github.com/ropensci/dev_guide/issues/281#issuecomment-704338449

maelle commented 3 years ago

To prevent Pandoc from autolinking bare URLs one can use

output: 
  rmarkdown::html_vignette:
    md_extensions: [ 
      "-autolink_bare_uris" 
    ]
maelle commented 3 years ago

But if you don't do that and there is an invalid plain URL in a vignette

jimhester commented 3 years ago

Ok, however I don't want to use commonmark to do this. I think we need try to keep the code as close to the base R code as we can to ensure reproducibility, sounds like we need to pass -autolink_bare_uris to the pandoc call.