r-lib / urlchecker

Run CRAN URL checks from older versions of R
https://urlchecker.r-lib.org/
GNU General Public License v3.0
45 stars 5 forks source link

<DOI:10.1023/A:1005082925477> is not accepted for conversion #30

Closed cthombor closed 10 months ago

cthombor commented 1 year ago

Please consider adjusting url_check()'s handling of \<DOI:...>, so that it accepts arbitrary strings in the suffix field, rather than throwing an "Invalid URI scheme" error when the suffix field contains a colon or seems bizarre in any other way.

Screenshot below:

Screenshot_20230118_040543

At https://link.springer.com/article/10.1023/A:1005082925477, the publisher Springer asserts that the DOI of this article is "10.1023/A:1005082925477".

dx.doi.org has no trouble with this (rather bizarre looking) suffix, with https://dx.doi.org/10.1023/A:1005082925477 redirecting to the publisher's page.

As noted in section 2.2 of the DOI handbook:

In use, the DOI name is an "opaque string" or "dumb number" — nothing at all can or should be inferred from the number in respect of its use in the DOI system. The only secure way of knowing anything about the entity that a particular DOI name identifies is by looking at the metadata that the Registrant of the DOI name declares at the time of registration. This means, for example, that even when the ownership of a particular item changes, its identifier remains the same — in perpetuity. This is why the DOI name is called a "persistent identifier".

The DOI syntax shall be made up of a DOI prefix and a DOI suffix separated by a forward slash.

There is no defined limit on the length of the DOI name, or of the DOI prefix or DOI suffix.

The DOI name is case-insensitive and can incorporate any printable characters from the legal graphic characters of Unicode.

<DOI:10.1023/A&#58;1005082925477> is not a workaround, giving the same error message.

Springer could use arbitrary chars in the suffixes of the other DOIs in their domain. So I think there's little to be gained by adding special-case to handle a "Springer-style" colon in the suffix of a DOI; although I'd guess that some escaping will be necessary, given the DOI: syntax of the callout.

I'm guessing that a \<DOI:10.1023/A:1005082925477> callout in the @description field of a 'data.R' element is what caused 'devtools::check_win_release()' to throw a fatal error (with a not-very-helpful message, reproduced below) on a package that I'm currently attempting to submit to CRAN:

[HttpException (0x80004005): A potentially dangerous Request.Path value was detected from the client (<).] System.Web.HttpRequest.ValidateInputIfRequiredByConfig() +11790877 System.Web.PipelineStepManager.ValidateHelper(HttpContext context) +54

'devtools::check_rhub(platforms="windows-x86_64-devel")' on my minimal 'testdoit' package does not throw an HttpException, although I think MiKTeX might have thrown an exception:

image

My glance at your 'urlchecker' codebase suggests to me that 'url_check()' is relying on something in Pandoc to handle the callouts.

So... perhaps this issue should be "kicked upstairs" to the pandoc team? But ... that'd require an MRE, which would in turn require some knowledge of pandoc -- and I'm a complete duffer in that regard.

Please understand that I'm not much more than a duffer in R, as this is my first experience with trying to create a package in R for submission to CRAN. I did hack pretty extensively in S, in the early 1990s, when using it to interpret test results from a multistream PRNG package I had developed in C/C++... but R has moved a long way from that codebase. In very impressive ways!

I'd like to take this opportunity to thank you and all the other volunteers for your work over the decades. It's still a very quirky language IMHO; but its packages are in amazingly good nick, with analysis and presentation features far more advanced than what I remember of S in the 1990s!

gaborcsardi commented 1 year ago

Thanks for the report! Can you please link to a package that reproduces the behavior you want to change, and also include the output of urlchecker::url_check() on that package? Thanks!

gaborcsardi commented 10 months ago

I am closing this for lack of information. Please reopen with more info if you still have this issue. Thanks!