ropensci-archive / crminer

:no_entry: ARCHIVED :no_entry: Fetch 'Scholary' Full Text from 'Crossref'
Other
17 stars 5 forks source link

Crossref API reports for some Wiley full text pdf's the mimetype 'unspecified', and therefore crm_pdf fails #9

Closed behrica closed 7 years ago

behrica commented 7 years ago

This does work:

l <- crm_links("10.2903/j.efsa.2016.4556",type="all")
crm_pdf(l)

while this not:

l <- crm_links("10.2903/j.efsa.2014.3550",type="all")
crm_pdf(l)

The root cause for the error, is that crossref API does return content-type 'unspecified' for the second case.

"message" : {
      "link" : [
         {
            "URL" : "https://api.wiley.com/onlinelibrary/tdm/v1/articles/10.2903%2Fj.efsa.2014.3550",
            "content-version" : "vor",
            "content-type" : "unspecified",
            "intended-application" : "text-mining"
         }
      ],

I can get it working by overring manualy the content type, like this:

l <- setNames(l, "pdf")
  attr(l, "type") <- "pdf"
  text <- crm_text(l, type = "pdf")

but this is of course a hack.

After investigation the code, I think there is a inconsitency between crm_links() which returns a mime-type 'unspecified' and the crm_text method, which cannot handle 'unspecified'.

I think crm_text() should be changed to be able to handle 'unspecified' and just plainly download the file and write it to disk, such in the same way how a download with 'curl' does it. Using curl the problem of mime-type = unspecified does not stop me from downloading.

I will take a look at this and send a PR, if I find a solution

sckott commented 7 years ago

Thanks for this. Right, unspecified is a common mime type, unfortunately.

And it's entirely possible it's not handled well right now, I do think i've accounted for this in the Ruby version of this package, but not here yet

sckott commented 7 years ago

fix via #10