ropensci / rcrossref

R client for various CrossRef APIs
https://docs.ropensci.org/rcrossref
Other
166 stars 20 forks source link

Score system #207

Closed Adafede closed 4 years ago

Adafede commented 4 years ago
Session Info ```r R version 4.0.0 (2020-04-24) Platform: x86_64-apple-darwin17.0 (64-bit) Running under: macOS Catalina 10.15.5 Matrix products: default BLAS: /System/Library/Frameworks/Accelerate.framework/Versions/A/Frameworks/vecLib.framework/Versions/A/libBLAS.dylib LAPACK: /Library/Frameworks/R.framework/Versions/4.0/Resources/lib/libRlapack.dylib locale: [1] fr_CH.UTF-8/fr_CH.UTF-8/fr_CH.UTF-8/C/fr_CH.UTF-8/fr_CH.UTF-8 attached base packages: [1] parallel stats graphics grDevices utils datasets methods base other attached packages: [1] zoo_1.8-8 XML_3.99-0.3 webchem_1.0.0 UpSetR_1.4.0 forcats_0.5.0 [6] tidyr_1.1.0 tibble_3.0.1 tidyverse_1.3.0 taxize_0.9.96 stringr_1.4.0 [11] stringi_1.4.6 splitstackshape_1.4.8 rvest_0.3.5 xml2_1.3.2 reticulate_1.16 [16] rentrez_1.2.2 readxl_1.3.1 readr_1.3.1 rcrossref_1.0.0 RColorBrewer_1.1-2 [21] purrr_0.3.4 pbmcapply_1.5.0 jsonlite_1.6.1 igraph_1.2.5 ggraph_2.0.3 [26] eulerr_6.1.0 dplyr_1.0.0 digest_0.6.25 data.table_1.12.8 collapsibleTree_0.1.7 [31] chorddiag_0.1.2 ChemmineR_3.40.0 plotly_4.9.2.1 Hmisc_4.4-0 ggplot2_3.3.1 [36] Formula_1.2-3 survival_3.1-12 lattice_0.20-41 loaded via a namespace (and not attached): [1] colorspace_1.4-1 rjson_0.2.20 ellipsis_0.3.1 htmlTable_1.13.3 fs_1.4.1 base64enc_0.1-3 [7] httpcode_0.3.0 rstudioapi_0.11 farver_2.0.3 urltools_1.7.3 graphlayouts_0.7.0 ggrepel_0.8.2 [13] DT_0.13 lubridate_1.7.8 fansi_0.4.1 codetools_0.2-16 splines_4.0.0 bold_1.0.0 [19] knitr_1.28 polyclip_1.10-0 broom_0.5.6 dbplyr_1.4.4 cluster_2.1.0 png_0.1-7 [25] ggforce_0.3.1 shiny_1.4.0.2 data.tree_0.7.11 compiler_4.0.0 httr_1.4.1 backports_1.1.7 [31] assertthat_0.2.1 Matrix_1.2-18 fastmap_1.0.1 lazyeval_0.2.2 cli_2.0.2 later_1.1.0.1 [37] tweenr_1.0.1 acepack_1.4.1 htmltools_0.4.0 tools_4.0.0 gtable_0.3.0 glue_1.4.1 [43] rsvg_2.1 tinytex_0.23 Rcpp_1.0.4.6 cellranger_1.1.0 vctrs_0.3.1 crul_0.9.0 [49] ape_5.4 nlme_3.1-148 iterators_1.0.12 xfun_0.14 mime_0.9 miniUI_0.1.1.1 [55] lifecycle_0.2.0 MASS_7.3-51.6 scales_1.1.1 tidygraph_1.2.0 hms_0.5.3 promises_1.1.0 [61] curl_4.3 gridExtra_2.3 triebeard_0.3.0 rpart_4.1-15 reshape_0.8.8 latticeExtra_0.6-29 [67] foreach_1.5.0 checkmate_2.0.0 bibtex_0.4.2.2 rlang_0.4.6 pkgconfig_2.0.3 bitops_1.0-6 [73] htmlwidgets_1.5.1 tidyselect_1.1.0 plyr_1.8.6 magrittr_1.5 R6_2.4.1 generics_0.0.2 [79] DBI_1.1.0 haven_2.3.1 pillar_1.4.4 foreign_0.8-80 withr_2.2.0 RCurl_1.98-1.2 [85] nnet_7.3-14 modelr_0.1.8 crayon_1.3.4 viridis_0.5.1 jpeg_0.1-8.1 grid_4.0.0 [91] blob_1.2.1 reprex_0.3.0 xtable_1.8-4 httpuv_1.5.4 munsell_0.5.0 viridisLite_0.3.0 ```

Hi,

Another issue, or question this time:

Let's take as example the following entry (X):

Luesch, Hendrik; Yoshida, Wesley Y.; Moore, Richard E.; Paul, Valerie J.; Journal of Natural Products; vol. 63; 10; (2000); p. 1437 - 1439.

After running ref retrieval (cr_works(query = X, sort = 'score', order = "desc")) at rank 1, I obtain

Apramides A−G, Novel Lipopeptides from the Marine CyanobacteriumLyngbya majuscula

with a score of 72.79652 . This result is WRONG

When giving only (Y):

Journal of Natural Products; vol. 63; 10; (2000); p. 1437 - 1439

(cr_works(query = Y, sort = 'score', order = "desc")) at rank 1, I obtain

Isolation and Structure of the Cytotoxin Lyngbyabellin B and Absolute Configuration of Lyngbyapeptin A from the Marine CyanobacteriumLyngbya majuscula

with a score of 26.922556 . This result is CORRECT

Conclusion: more information with higher score leads to wrong result and less information with lower score leads to correct result (voluntarily expressed so...)

How would you judge this? Is there any option I missed to maybe help with those kind of problems?

Many thanks

sckott commented 4 years ago

thanks for the issue @Adafede

in your use case, do you have the ability to split up the components of the citation, eg., to authors, title, volume, issue, year, etc. ?

Adafede commented 4 years ago

It is precisely my problem. I work with heterogeneous data. Some data is insanely clean and each field atomic, other one contains everything mixed up with no characterizable splitter to allow splitting of the data. (the example I posted is fat from being the worst one)

If I can split, I do but sometimes I sadly can't confidently.

sckott commented 4 years ago

Have you tried field queries?

I think this may work better:

x="Luesch, Hendrik; Yoshida, Wesley Y.; Moore, Richard E.; Paul, Valerie J.; Journal of Natural Products; vol. 63; 10; (2000); p. 1437 - 1439."
z <- cr_works(flq = c(query.bibliographic = x), sort = 'score', order = "desc")
z$data$title[1:2]
#> [1] "Isolation and Structure of the Cytotoxin Lyngbyabellin B and Absolute Configuration of Lyngbyapeptin A from the Marine CyanobacteriumLyngbya majuscula"
#> [2] "Apramides A−G, Novel Lipopeptides from the Marine CyanobacteriumLyngbya majuscula"
Adafede commented 4 years ago

I had a quick look at it thinking it could indeed be a good option but did not test them yet!

I'll try with your suggestion and come back to you.

Thank you very much :)

Adafede commented 4 years ago

Hi, coming back to you again, it works indeed way better!

Thank you very much!

sckott commented 4 years ago

great, glad it works!