r-lib / pkgdown

Generate static html documentation for an R package
https://pkgdown.r-lib.org/
Other
713 stars 333 forks source link

<doi:> references in the DESCRIPTION #499

Closed bastistician closed 6 years ago

bastistician commented 6 years ago

If the package does not contain an index.[R]md or README.[R]md file, build_home() falls back to the description field in the package's DESCRIPTION file:

https://github.com/r-lib/pkgdown/blob/c94e6b2d6ecc69c1a2e4093cd5736aa97dac94f9/R/build-home.R#L59

If this text contains <doi:...> (or <arXiv:...>) references, build_site() will currently fail with an error from xml2::read_html() in update_homepage_html() saying

Name doi:10.[...] is not XML Namespace compliant [202]

Dozens of scientific packages on CRAN use this DOI referencing feature, for example the party package (for a use case of <arXiv:...>, see for example palmtree).

Ideally, build_home() should replace such DOIs and arXiv identifiers by hyperlinks just as on CRAN.

Similary, the description may contain direct weblinks given in angle brackets (<http://...>, as in, e.g., partykit), which are not parsed correctly (but do not break build_site()).

hadley commented 6 years ago

Would you be interested in implementing this? I suspect it mostly involves some spelunking to find the function that does this for CRAN

hadley commented 6 years ago

Ok, I've done a bit of spelunking at it doesn't look like the code is publicly available anywhere. However it should be fairly straightforward to implement from first principles using regular expressions.

bastistician commented 6 years ago

Here's my humble draft of a function to convert bracketed links in the description text to html:

linkify <- function(text) {
  text <- gsub("<doi:([^>]+)>",
               "&lt;<a href='https://doi.org/\\1'>doi:\\1</a>&gt;",
               text, ignore.case = TRUE)
  text <- gsub("<arXiv:([^>]+)>",
               "&lt;<a href='https://arxiv.org/abs/\\1'>arXiv:\\1</a>&gt;",
               text, ignore.case = TRUE)
  text <- gsub("<((http|ftp)[^>]+)>",
               "&lt;<a href='\\1'>\\1</a>&gt;",
               text)
  text
}

However, the result may still contain special html characters (<, >, &) which need to be escaped... Maybe htmltools can help here?

A more sophisticated implementation, which takes care of links and DOI (but not arXiv) references is htmlify() as defined within the recently added tools:::toHTML.citation() (https://github.com/wch/r-source/blob/565bc1896fe1c971c65348dce2bc6d8412136c92/src/library/tools/R/toHTML.R#L282).

bastistician commented 6 years ago

I just found out that there is no need to escape independently occurring <, >, and & symbols in the description text. This will be done by the subsequent update_homepage_html() in build_home() (probably with xml2 magic) . So the above linkify() should suffice. What do you think?

hadley commented 6 years ago

I think you should consider the fix by update_homepage_html() to be incidental to its purpose, and it would be better to explicitly call escape_html() (at the start of linkify()). Otherwise, looks good.