ropensci-archive / rromeo

:package: An R client for SHERPA/RoMEO API V1
GNU General Public License v3.0
14 stars 0 forks source link

Recursive queries fail when SHERPA/RoMEO doesn't know the ISSN #10

Closed Bisaloo closed 6 years ago

Bisaloo commented 6 years ago

For example:

rr_journal_name("Evolutionary", qtype = "contains", multiple = FALSE)

will return a dataframe of journals and some of them do not have a ISSN. So multiple = TRUE fails because validate_issn() complains that "" is not a valid ISSN.

Maybe we should perform the search using the title in this case? We would need to check what happens with XML- or HTTP-encoded characters

Also, some journals with a missing ISSN still have a ESSN, maybe we can do something with this.

Rekyt commented 6 years ago

Then maybe the easiest thing to do would be to search for all exact titles if we found them at first. However it seems that sometimes even if the title is returned can be absent from RoMEO:

library("rromeo")
rr_journal_name("Evolutionary", qtype = "contains", multiple = FALSE) -> h
#> Warning in parse_answer(api_answer, multiple = multiple): 43 journals match your query terms.
#> Warning in parse_answer(api_answer, multiple = multiple): Select one
#> journal from the provided list or enable multiple = TRUE
lapply(h$title, function(x) rr_journal_name(x, qtype = "exact"))
#> Error in parse_answer(api_answer, multiple = multiple): No journal matches your query terms. Please try another query.

Created on 2018-11-03 by the reprex package (v0.2.1)

Rekyt commented 6 years ago

We can get which journal has problem using the following snippet:

Code Snippet ```r library("rromeo") rr_journal_name("Evolutionary", qtype = "contains", multiple = FALSE) -> h #> Warning in parse_answer(api_answer, multiple = multiple): 43 journals match your query terms. #> Warning in parse_answer(api_answer, multiple = multiple): Select one #> journal from the provided list or enable multiple = TRUE g = lapply(h$title, function(x) tryCatch({ rr_journal_name(x, qtype = "exact") }, error = function(error) { data.frame(title = x) })) suppressWarnings(dplyr::bind_rows(g)) #> title #> 1 Advances in evolutionary biology #> 2 Anatomical Record: Advances in Integrative Anatomy and Evolutionary Biology #> 3 Anatomical Record Part a Discoveries in Molecular Cellular and Evolutionary Biology #> 4 BIOINFO Evolutionary Biology #> 5 BMC Evolutionary Biology #> 6 Cambridge Studies in Biological and Evolutionary Anthropology #> 7 Evolutionary and Institutional Economics Review #> 8 Evolutionary Anthropology #> 9 Evolutionary Applications #> 10 Evolutionary Behavioral Sciences #> 11 Evolutionary Bioinformatics #> 12 Evolutionary Biology #> 13 Evolutionary Computation #> 14 Evolutionary computation, machine learning and data mining in bioinformatics. EvoBIO (Conference), author #> 15 Evolutionary Ecology #> 16 Evolutionary Ecology Research #> 17 Evolutionary Intelligence #> 18 Evolutionary Psychological Science #> 19 Evolutionary Psychology #> 20 Evolutionary Systematics #> 21 Frontiers in Evolutionary Neuroscience #> 22 Genetic and Evolutionary Computation Conference : [proceedings] / sponsored by ACM SIGEVO. Genetic and Evolutionary Computation Conference #> 23 IEEE Transactions on Evolutionary Computation #> 24 International Journal of Applied Evolutionary Computation #> 25 International Journal of Evolutionary Biology #> 26 International Journal of Systematic and Evolutionary Microbiology #> 27 Journal of Cultural and Evolutionary Psychology #> 28 Journal of Evolutionary Biochemistry and Physiology / Zhurnal Evolyutsionnoi Biokhimii i Fiziologii #> 29 Journal of Evolutionary Biology #> 30 Journal of Evolutionary Biology Research #> 31 Journal of Evolutionary Economics #> 32 Journal of Evolutionary Psychology #> 33 Journal of Evolutionary Studies in Business #> 34 Journal of Phylogenetics and Evolutionary Biology #> 35 Journal of Social and Evolutionary Systems #> 36 Journal of social, evolutionary & cultural psychology : JSEC #> 37 Journal of Zoological Systematics and Evolutionary Research #> 38 Proceedings. Consortium on Revolutionary Europe, 1750-1850 #> 39 Proceedings of the ... Congress on Evolutionary Computation. Congress on Evolutionary Computation #> 40 Proceedings of the Genetic and Evolutionary Computation Conference / GECCO. Genetic and Evolutionary Computation Conference #> 41 Revolutionary Russia #> 42 Swarm and Evolutionary Computation #> 43 Trends in Evolutionary Biology #> issn preprint postprint pdf romeocolour #> 1 2356-671X #> 2 1932-8486 can restricted cannot yellow #> 3 1552-4884 unknown unknown unknown gray #> 4 can can can green #> 5 1471-2148 can can can green #> 6 1746-2266 can can cannot green #> 7 1349-4961 can can cannot green #> 8 1060-1538 can restricted cannot yellow #> 9 1752-4563 can can can green #> 10 2330-2925 can can cannot green #> 11 1176-9343 can can can green #> 12 0071-3260 can can cannot green #> 13 1063-6560 can can restricted green #> 14 #> 15 0269-7653 can can cannot green #> 16 1522-0613 cannot restricted restricted white #> 17 1864-5909 can can cannot green #> 18 2198-9885 can can cannot green #> 19 1474-7049 can can can green #> 20 2535-0730 unclear can can blue #> 21 1663-070X can can can green #> 22 #> 23 1089-778X can can cannot green #> 24 1942-3594 cannot cannot can blue #> 25 2090-8032 can can can green #> 26 1466-5026 can can cannot green #> 27 1589-5254 can can cannot green #> 28 #> 29 1010-061X can restricted cannot yellow #> 30 can can can green #> 31 0936-9937 can can cannot green #> 32 1789-2082 can can cannot green #> 33 2385-7137 unknown unknown unknown gray #> 34 2329-9002 unclear can can blue #> 35 1061-7361 #> 36 #> 37 0947-5745 can restricted cannot yellow #> 38 0093-2574 #> 39 #> 40 #> 41 0954-6545 can can cannot green #> 42 2210-6502 can can cannot green #> 43 2036-265X can can can green ``` Created on 2018-11-03 by the [reprex package](https://reprex.tidyverse.org) (v0.2.1)

Giving the following result:

structure(list(title = c("Evolutionary computation, machine learning and data mining in bioinformatics. EvoBIO (Conference), author", 
"Genetic and Evolutionary Computation Conference : [proceedings] / sponsored by ACM SIGEVO. Genetic and Evolutionary Computation Conference", 
"Journal of Evolutionary Biochemistry and Physiology / Zhurnal Evolyutsionnoi Biokhimii i Fiziologii", 
"Journal of social, evolutionary & cultural psychology : JSEC", 
"Proceedings of the Genetic and Evolutionary Computation Conference / GECCO. Genetic and Evolutionary Computation Conference"
), issn = c(NA_character_, NA_character_, NA_character_, NA_character_, 
NA_character_), preprint = c(NA_character_, NA_character_, NA_character_, 
NA_character_, NA_character_), postprint = c(NA_character_, NA_character_, 
NA_character_, NA_character_, NA_character_), pdf = c(NA_character_, 
NA_character_, NA_character_, NA_character_, NA_character_), 
    romeocolour = c(NA_character_, NA_character_, NA_character_, 
    NA_character_, NA_character_)), class = "data.frame", row.names = c(NA, 
-5L))

Journals with "/" in their names have their translations. For example there is Journal of Evolutionary Biochemistry and Physiology / Zhurnal Evolyutsionnoi Biokhimii i Fiziologii If we query the full name we get no results:

rr_journal_name("Journal of Evolutionary Biochemistry and Physiology / Zhurnal Evolyutsionnoi Biokhimii i Fiziologii", qtype = "exact")
#> Error in parse_answer(api_answer, multiple = multiple) : 
#>  No journal matches your query terms. Please try another query

While querying only the English name returns results:

rr_journal_name("Journal of Evolutionary Biochemistry and Physiology", qtype = "exact")
#>                                                                                                title
#>1 Journal of Evolutionary Biochemistry and Physiology / Zhurnal Evolyutsionnoi Biokhimii i Fiziologii
#>       issn preprint postprint     pdf romeocolour
#>1 0022-0930  unclear       can unknown        blue
Rekyt commented 6 years ago

So there doesn't seem to be a quick an easy solution to date... From the API docs it seems possible to query the API with ESSN. Do we get the ESSN back when looking multiple queries?

Bisaloo commented 6 years ago

Do we get the ESSN back when looking multiple queries?

Hum, no, we don't :confused:

Rekyt commented 6 years ago

For the moment we could drop the journals that don't have ISSN with a warning. That would avoid the problems when using multiple = TRUE

Rekyt commented 6 years ago

Even adding a warning there are still problems because some journals have two entries in the database like the following http://www.sherpa.ac.uk/romeo/search.php?jtitle=evolution+psychiatrique&issn=0014-3855&zetocpub=Elsevier+Masson&romeopub=Elsevier&fIDnum=|&mode=simple&la=en&version=&source=journal&sourceid=10528

So when querying we get different warnings:

library("rromeo")
rr_journal_name("Évolution Psychiatrique", multiple = FALSE, qtype = "exact")
#> Warning in parse_answer(api_answer, multiple = multiple): 2 journals match
#> your query terms.
#> Warning in parse_answer(api_answer, multiple = multiple): Select one
#> journal from the provided list or enable multiple = TRUE
#>                     title      issn
#> 1 Évolution Psychiatrique 0014-3855
rr_journal_name("Évolution Psychiatrique", multiple = TRUE, qtype = "exact")
#> Warning in parse_answer(api_answer, multiple = multiple): 2 journals match
#> your query terms.
#> Recursively fetching data from each journal. This may take some time...
#> Warning in parse_answer(api_answer, multiple = FALSE): 2 journals match
#> your query terms.
#> Warning in parse_answer(api_answer, multiple = FALSE): Select one journal
#> from the provided list or enable multiple = TRUE
#>                     title      issn
#> 1 Évolution Psychiatrique 0014-3855

Created on 2018-11-05 by the reprex package (v0.2.1)

Bisaloo commented 6 years ago

Nice catch!

We can actually find those edge cases by parsing the outcome field. In the case of issn=0014-3855, it returns uniqueZetoc. For a "normal" single journal, it returns singleJournal and for multiple journals, it returns manyJournals.

Now, we need to ensure that xml_find_first will return the correct policy in those cases.

Bisaloo commented 6 years ago

http://www.sherpa.ac.uk/romeo/publishertypes.php?fIDnum=|&mode=simple&la=en&version=

I'm not really sure what that means in the case of 0014-3855 for example...

Bisaloo commented 6 years ago

FYI: I'm going to prepare and push some commits for this.

Bisaloo commented 6 years ago

Should we add a warning in this case :thinking:?

Rekyt commented 6 years ago

Well, if the user reached the limit that would still be important to know, right?

Bisaloo commented 6 years ago

Hum, I'm not sure what you mean. I was thinking of a warning here:

https://github.com/Rekyt/rromeo/blob/494d453a4615ff2c196a023d0b7c7a3d15ed41d3/R/utils.R#L33-L37

saying something like: "this journal has multiple publishers with different policies. We tried to return the most relevant one but you should also check the detailed policy."

Rekyt commented 6 years ago

Woops I misunderstood! Haven't seen the second commit. Yep at least a message telling that we chose to return a single policy.

Rekyt commented 6 years ago

Fixed by PR #16