ropensci / openalexR

Getting bibliographic records from OpenAlex
https://docs.ropensci.org/openalexR/
Other
91 stars 20 forks source link

Open access #135

Closed trangdata closed 1 year ago

trangdata commented 1 year ago

Closes #118, #134

yjunechoe commented 1 year ago

Good to merge as is, but one thing for us to think about is that sometimes the is_oa column doesn't match the open_access$is_oa nested column.

Our hands are of course kinda tied in these cases, but there are some false negatives (is_oa is FALSE but the nested open_access$is_oa column is TRUE) which we could catch and fix (by also setting is_oa to TRUE in those cases):

false_negative_OAs <- c(
  "https://openalex.org/W2165545766",
  "https://openalex.org/W2104143313",
  "https://openalex.org/W1533806699",
  "https://openalex.org/W2005616914",
  "https://openalex.org/W106193068",
  "https://openalex.org/W2626261620",
  "https://openalex.org/W2219574487",
  "https://openalex.org/W4236254570",
  "https://openalex.org/W2800319334",
  "https://openalex.org/W1947610836",
  "https://openalex.org/W3168388584",
  "https://openalex.org/W2325295993",
  "https://openalex.org/W4235001864
")

oa_fetch(identifier = false_negative_OAs) |> 
  dplyr::select(id, is_oa, open_access) |> 
  tidyr::unnest_longer(open_access)
#> # A tibble: 13 Ă— 3
#>    id                               is_oa open_access$is_oa $oa_status $oa_url                                                                                   $any_repository_has_fulltext
#>    <chr>                            <lgl> <lgl>             <chr>      <chr>                                                                                     <lgl>                       
#>  1 https://openalex.org/W2165545766 FALSE TRUE              green      http://www.mapageweb.umontreal.ca/tuitekj/cours/chomsky/Hauser-Chomsky-Fitch.pdf          TRUE                        
#>  2 https://openalex.org/W2104143313 FALSE TRUE              green      https://dash.harvard.edu/bitstream/1/3117935/1/Hauser_EvolutionLanguageFaculty.pdf        TRUE                        
#>  3 https://openalex.org/W1533806699 FALSE TRUE              green      https://dspace.mit.edu/bitstream/1721.1/86586/2/48125267-MIT.pdf                          TRUE                        
#>  4 https://openalex.org/W2005616914 FALSE TRUE              green      https://dspace.mit.edu/bitstream/1721.1/103525/2/10936_2014_9331_ReferencePDF.pdf         TRUE                        
#>  5 https://openalex.org/W106193068  FALSE TRUE              green      http://www.mapageweb.umontreal.ca/tuitekj/cours/chomsky/Hauser-Chomsky-Fitch.pdf          TRUE                        
#>  6 https://openalex.org/W2626261620 FALSE TRUE              green      https://dspace.mit.edu/bitstream/1721.1/128683/2/EveraertTICS2017online.pdf               TRUE                        
#>  7 https://openalex.org/W2219574487 FALSE TRUE              green      https://europepmc.org/articles/pmc2755450?pdf=render                                      TRUE                        
#>  8 https://openalex.org/W4236254570 FALSE TRUE              green      https://dspace.mit.edu/bitstream/1721.1/127796/2/10767_2015_9206_ReferencePDF.pdf         TRUE                        
#>  9 https://openalex.org/W2800319334 FALSE TRUE              green      https://openyls.law.yale.edu/bitstream/20.500.13051/15438/2/51_80YaleLJ1456_June1971_.pdf TRUE                        
#> 10 https://openalex.org/W1947610836 FALSE TRUE              green      http://www.sfu.ca/~jeffpell/papers/Lepore-Pelletier.pdf                                   TRUE                        
#> 11 https://openalex.org/W3168388584 FALSE TRUE              green      http://escholarship.mcgill.ca/downloads/3j333404c                                         TRUE                        
#> 12 https://openalex.org/W2325295993 FALSE TRUE              green      https://journals.openedition.org/lettre-cdf/pdf/898                                       TRUE                        
#> 13 https://openalex.org/W4235001864 FALSE TRUE              green      https://ejop.psychopen.eu/index.php/ejop/article/download/574/574.pdf                     TRUE
trangdata commented 1 year ago

Nice catch, June đź‘€ . It's interesting that primary_location$is_oa is not the same as open_access$is_oa. So the closest-to-version-of-record copy of this work is not open but some other version of it is open elsewhere.

I think I'll do this: unnest this open_access column to 4 columns, and I'll rename open_access$is_oa to is_oa_anywhere. Another advantage of doing this is bringing the oa_url up one level and make that more obvious to the user. What do you think? @yjunechoe

yjunechoe commented 1 year ago

I like the is_oa_anywhere renaming/hoisting - good idea!

I'm also on board with bring oa_url up, but curious whether that's guaranteed to be length-1 (I didn't encounter any in my quick search but đź‘€ ).

In any case, I agree that a top-level oa_url column would be super useful!

trangdata commented 1 year ago

curious whether that's guaranteed to be length-1

You're thorough as always! I just checked and it looks like this is the "best" URL, so we're expecting length-1 for oa_url! https://docs.openalex.org/api-entities/works/work-object#oa_url

trangdata commented 1 year ago

What it looks like now:

library(openalexR)
false_negative_OAs <- c(
  "https://openalex.org/W2165545766",
  "https://openalex.org/W2104143313",
  "https://openalex.org/W1533806699",
  "https://openalex.org/W2005616914",
  "https://openalex.org/W106193068",
  "https://openalex.org/W2626261620",
  "https://openalex.org/W2219574487",
  "https://openalex.org/W4236254570",
  "https://openalex.org/W2800319334",
  "https://openalex.org/W1947610836",
  "https://openalex.org/W3168388584",
  "https://openalex.org/W2325295993",
  "https://openalex.org/W4235001864
")
oa_fetch(identifier = false_negative_OAs) |> 
  dplyr::select(id, dplyr::contains("oa"), pdf_url)
#> # A tibble: 13 Ă— 6
#>    id                               is_oa is_oa_anywhere oa_sta…¹ oa_url pdf_url
#>    <chr>                            <lgl> <lgl>          <chr>    <chr>  <lgl>  
#>  1 https://openalex.org/W2165545766 FALSE TRUE           green    http:… NA     
#>  2 https://openalex.org/W2104143313 FALSE TRUE           green    https… NA     
#>  3 https://openalex.org/W1533806699 FALSE TRUE           green    https… NA     
#>  4 https://openalex.org/W2005616914 FALSE TRUE           green    https… NA     
#>  5 https://openalex.org/W106193068  FALSE TRUE           green    http:… NA     
#>  6 https://openalex.org/W2626261620 FALSE TRUE           green    https… NA     
#>  7 https://openalex.org/W2219574487 FALSE TRUE           green    https… NA     
#>  8 https://openalex.org/W4236254570 FALSE TRUE           green    https… NA     
#>  9 https://openalex.org/W2800319334 FALSE TRUE           green    https… NA     
#> 10 https://openalex.org/W1947610836 FALSE TRUE           green    http:… NA     
#> 11 https://openalex.org/W3168388584 FALSE TRUE           green    http:… NA     
#> 12 https://openalex.org/W2325295993 FALSE TRUE           green    https… NA     
#> 13 https://openalex.org/W4235001864 FALSE TRUE           green    https… NA     
#> # … with abbreviated variable name ¹​oa_status

Created on 2023-07-21 with reprex v2.0.2

yjunechoe commented 1 year ago

this is the "best" URL, so we're expecting length-1

Oh this is neat!

That's it from me then - everything looks good!