ropensci / openalexR

Getting bibliographic records from OpenAlex
https://docs.ropensci.org/openalexR/
Other
97 stars 21 forks source link

topics in df? #213

Closed rkrug closed 7 months ago

rkrug commented 7 months ago

Hi

Is there a timeline, when topics will be included in the works2df()? I am asking as I would need the topics there rather soon and would put to gather an additional fetching as list if it is not in the next 1-2 weeks? Just asking to possibly save me some work.

Thanks

yjunechoe commented 7 months ago

It's on the radar but no set date for implementing this. I'll just note here (for my future self) that topics looks like it's got a pretty complicated nested list structure. We'll need to think through the best tidy-data representation for this.


oa_fetch(identifier = "W1964141474", output = "list")$topics
#> [[1]]
#> [[1]]$id
#> [1] "https://openalex.org/T10103"
#> 
#> [[1]]$display_name
#> [1] "Development of Reading Skills and Dyslexia"
#> 
#> [[1]]$score
#> [1] 1
#> 
#> [[1]]$subfield
#> [[1]]$subfield$id
#> [1] "https://openalex.org/subfields/3204"
#> 
#> [[1]]$subfield$display_name
#> [1] "Developmental and Educational Psychology"
#> 
#> 
#> [[1]]$field
#> [[1]]$field$id
#> [1] "https://openalex.org/fields/32"
#> 
#> [[1]]$field$display_name
#> [1] "Psychology"
#> 
#> 
#> [[1]]$domain
#> [[1]]$domain$id
#> [1] "https://openalex.org/domains/2"
#> 
#> [[1]]$domain$display_name
#> [1] "Social Sciences"
#> 
#> 
#> 
#> [[2]]
#> [[2]]$id
#> [1] "https://openalex.org/T11345"
#> 
#> [[2]]$display_name
#> [1] "Development of Numerical Cognition and Mathematics Abilities"
#> 
#> [[2]]$score
#> [1] 0.9989
#> 
#> [[2]]$subfield
#> [[2]]$subfield$id
#> [1] "https://openalex.org/subfields/2613"
#> 
#> [[2]]$subfield$display_name
#> [1] "Statistics and Probability"
#> 
#> 
#> [[2]]$field
#> [[2]]$field$id
#> [1] "https://openalex.org/fields/26"
#> 
#> [[2]]$field$display_name
#> [1] "Mathematics"
#> 
#> 
#> [[2]]$domain
#> [[2]]$domain$id
#> [1] "https://openalex.org/domains/3"
#> 
#> [[2]]$domain$display_name
#> [1] "Physical Sciences"
#> 
#> 
#> 
#> [[3]]
#> [[3]]$id
#> [1] "https://openalex.org/T13106"
#> 
#> [[3]]$display_name
#> [1] "Neuroscience and Education: Bridging Research and Practice"
#> 
#> [[3]]$score
#> [1] 0.9905
#> 
#> [[3]]$subfield
#> [[3]]$subfield$id
#> [1] "https://openalex.org/subfields/2805"
#> 
#> [[3]]$subfield$display_name
#> [1] "Cognitive Neuroscience"
#> 
#> 
#> [[3]]$field
#> [[3]]$field$id
#> [1] "https://openalex.org/fields/28"
#> 
#> [[3]]$field$display_name
#> [1] "Neuroscience"
#> 
#> 
#> [[3]]$domain
#> [[3]]$domain$id
#> [1] "https://openalex.org/domains/1"
#> 
#> [[3]]$domain$display_name
#> [1] "Life Sciences"
yjunechoe commented 7 months ago

Pardon the messy code, but this is one proposal for tidying the topics lists-of-lists. Roughly two options here a long one and a wider one:

library(openalexR)
library(tidyverse)
topics <- oa_fetch(identifier = "W1964141474", output = "list")$topics

wide <- lapply(topics, function(x) {
  c(
    list(topic = x[c("id", "display_name")]),
    x["subfield"], x["field"], x["domain"])
}) %>% 
  enframe(name = "i") %>% 
  unnest_longer(value) %>% 
  pivot_wider(id_cols = i, names_from = value_id, values_from = value) %>% 
  unnest(cols = c(topic, subfield, field, domain)) %>% 
  unnest(cols = c(topic, subfield, field, domain)) %>% 
  mutate(value = c("id", "display_name"), .by = i, .after = 1)
wide
#> # A tibble: 6 × 6
#>       i value        topic                                 subfield field domain
#>   <int> <chr>        <chr>                                 <chr>    <chr> <chr> 
#> 1     1 id           https://openalex.org/T10103           https:/… http… https…
#> 2     1 display_name Development of Reading Skills and Dy… Develop… Psyc… Socia…
#> 3     2 id           https://openalex.org/T11345           https:/… http… https…
#> 4     2 display_name Development of Numerical Cognition a… Statist… Math… Physi…
#> 5     3 id           https://openalex.org/T13106           https:/… http… https…
#> 6     3 display_name Neuroscience and Education: Bridging… Cogniti… Neur… Life …

long <- wide %>% 
  pivot_longer(cols = topic:domain, values_to = "x") %>% 
  rename(field = value, value = x)
long
#> # A tibble: 24 × 4
#>        i field        name     value                                     
#>    <int> <chr>        <chr>    <chr>                                     
#>  1     1 id           topic    https://openalex.org/T10103               
#>  2     1 id           subfield https://openalex.org/subfields/3204       
#>  3     1 id           field    https://openalex.org/fields/32            
#>  4     1 id           domain   https://openalex.org/domains/2            
#>  5     1 display_name topic    Development of Reading Skills and Dyslexia
#>  6     1 display_name subfield Developmental and Educational Psychology  
#>  7     1 display_name field    Psychology                                
#>  8     1 display_name domain   Social Sciences                           
#>  9     2 id           topic    https://openalex.org/T11345               
#> 10     2 id           subfield https://openalex.org/subfields/2613       
#> # ℹ 14 more rows
trangdata commented 7 months ago

Thank you for putting this together @yjunechoe. The nested structure is quite cumbersome indeed! If I may add another option for unnesting this, following your code (I find it a bit more intuitive to have id and display_name as columns right next to each other:

library(openalexR)
library(tidyverse)
topics <- oa_fetch(identifier = "W1964141474", output = "list")$topics

wide <- lapply(topics, function(x) {
  c(
    list(topic = x[c("id", "display_name")]),
    x["subfield"], x["field"], x["domain"])
}) %>% 
  enframe(name = "i") %>% 
  unnest_longer(value) %>% 
  pivot_wider(id_cols = i, names_from = value_id, values_from = value) %>% 
  unnest(cols = c(topic, subfield, field, domain)) %>% 
  unnest(cols = c(topic, subfield, field, domain)) %>% 
  mutate(value = c("id", "display_name"), .by = i, .after = 1)

long <- wide %>% 
  pivot_longer(cols = topic:domain, values_to = "x") %>% 
  rename(field = value, value = x)

long %>%
  pivot_wider(
    names_from = field, 
    values_from = value, 
    id_cols = c(i, name)
  )
#> # A tibble: 12 × 4
#>        i name     id                                  display_name              
#>    <int> <chr>    <chr>                               <chr>                     
#>  1     1 topic    https://openalex.org/T10103         Development of Reading Sk…
#>  2     1 subfield https://openalex.org/subfields/3204 Developmental and Educati…
#>  3     1 field    https://openalex.org/fields/32      Psychology                
#>  4     1 domain   https://openalex.org/domains/2      Social Sciences           
#>  5     2 topic    https://openalex.org/T11345         Development of Numerical …
#>  6     2 subfield https://openalex.org/subfields/2613 Statistics and Probability
#>  7     2 field    https://openalex.org/fields/26      Mathematics               
#>  8     2 domain   https://openalex.org/domains/3      Physical Sciences         
#>  9     3 topic    https://openalex.org/T13106         Neuroscience and Educatio…
#> 10     3 subfield https://openalex.org/subfields/2805 Cognitive Neuroscience    
#> 11     3 field    https://openalex.org/fields/28      Neuroscience              
#> 12     3 domain   https://openalex.org/domains/1      Life Sciences

Created on 2024-02-27 with reprex v2.0.2

yjunechoe commented 7 months ago

Thanks @trangdata - I like this a lot! Also I realize that I lost the score information in the processing. So for absolute completeness:

long %>%
  pivot_wider(
    names_from = field, 
    values_from = value, 
    id_cols = c(i, name)
  ) %>% 
  mutate(
    score = sapply(topics, `[[`, "score")[i],
    .after = 1
  )
#> # A tibble: 12 × 5
#>        i score name     id                                  display_name                             
#>    <int> <dbl> <chr>    <chr>                               <chr>                                    
#>  1     1 1     topic    https://openalex.org/T10103         Development of Reading Skills and Dyslex…
#>  2     1 1     subfield https://openalex.org/subfields/3204 Developmental and Educational Psychology 
#>  3     1 1     field    https://openalex.org/fields/32      Psychology                               
#>  4     1 1     domain   https://openalex.org/domains/2      Social Sciences                          
#>  5     2 0.999 topic    https://openalex.org/T11345         Development of Numerical Cognition and M…
#>  6     2 0.999 subfield https://openalex.org/subfields/2613 Statistics and Probability               
#>  7     2 0.999 field    https://openalex.org/fields/26      Mathematics                              
#>  8     2 0.999 domain   https://openalex.org/domains/3      Physical Sciences                        
#>  9     3 0.990 topic    https://openalex.org/T13106         Neuroscience and Education: Bridging Res…
#> 10     3 0.990 subfield https://openalex.org/subfields/2805 Cognitive Neuroscience                   
#> 11     3 0.990 field    https://openalex.org/fields/28      Neuroscience                             
#> 12     3 0.990 domain   https://openalex.org/domains/1      Life Sciences

Now to write this in base R 😅

rkrug commented 7 months ago

It's on the radar but no set date for implementing this.

Thanks a lot.

trangdata commented 7 months ago

Base feels a little clunky but maybe not to bad?

library(openalexR)
paper <- oa_fetch(identifier = "W1964141474", output = "list")
topics_ls <- list()
for (i in seq_along(paper$topics)){
  topic <- paper$topics[[i]]
  # relevel the nested structure, then combine
  relev <- c(list(topic = topic[c("id", "display_name")]), tail(topic, -3))
  topics_ls[[i]] <- cbind(
    i = i, 
    score = topic$score,
    tibble::rownames_to_column(openalexR:::subs_na(relev, "rbind_df")[[1]], "name")
  )
}
topics <- do.call(rbind.data.frame, topics_ls)
topics
#>    i  score     name                                  id
#> 1  1 1.0000    topic         https://openalex.org/T10103
#> 2  1 1.0000 subfield https://openalex.org/subfields/3204
#> 3  1 1.0000    field      https://openalex.org/fields/32
#> 4  1 1.0000   domain      https://openalex.org/domains/2
#> 5  2 0.9989    topic         https://openalex.org/T11345
#> 6  2 0.9989 subfield https://openalex.org/subfields/2613
#> 7  2 0.9989    field      https://openalex.org/fields/26
#> 8  2 0.9989   domain      https://openalex.org/domains/3
#> 9  3 0.9905    topic         https://openalex.org/T13106
#> 10 3 0.9905 subfield https://openalex.org/subfields/2805
#> 11 3 0.9905    field      https://openalex.org/fields/28
#> 12 3 0.9905   domain      https://openalex.org/domains/1
#>                                                    display_name
#> 1                    Development of Reading Skills and Dyslexia
#> 2                      Developmental and Educational Psychology
#> 3                                                    Psychology
#> 4                                               Social Sciences
#> 5  Development of Numerical Cognition and Mathematics Abilities
#> 6                                    Statistics and Probability
#> 7                                                   Mathematics
#> 8                                             Physical Sciences
#> 9    Neuroscience and Education: Bridging Research and Practice
#> 10                                       Cognitive Neuroscience
#> 11                                                 Neuroscience
#> 12                                                Life Sciences

Created on 2024-02-27 with reprex v2.0.2

yjunechoe commented 7 months ago

Looks good! I tweaked it slightly and wrapped it into a function. Shall we move this to a PR?

library(openalexR)

process_paper_topics <- function(paper) {
  topics <- paper$topics
  topics_ls <- lapply(seq_along(topics), function(i) {
    topic <- topics[[i]]
    relev <- c(
      # Hoist fields for the topic entity
      list(topic = topic[c("id", "display_name")]),
      # Keep info about other entities as-is
      Filter(is.list, topic)
    )
    relev_df <- openalexR:::subs_na(relev, "rbind_df")[[1]]
    relev_df <- tibble::rownames_to_column(relev_df, "name")
    cbind(i = i, score = topic$score, relev_df)
  })
  topics_df <- do.call(rbind.data.frame, topics_ls)
  tibble::as_tibble(topics_df)
}

papers <- oa_fetch(identifier = c("W1964141474", "W2741809807"), output = "list")
lapply(papers, process_paper_topics)
#> [[1]]
#> # A tibble: 12 × 5
#>        i score name     id                                  display_name        
#>    <int> <dbl> <chr>    <chr>                               <chr>               
#>  1     1 0.997 topic    https://openalex.org/T10102         Bibliometric Analys…
#>  2     1 0.997 subfield https://openalex.org/subfields/1804 Statistics, Probabi…
#>  3     1 0.997 field    https://openalex.org/fields/18      Decision Sciences   
#>  4     1 0.997 domain   https://openalex.org/domains/2      Social Sciences     
#>  5     2 0.981 topic    https://openalex.org/T13607         Preprints in Scient…
#>  6     2 0.981 subfield https://openalex.org/subfields/1802 Information Systems…
#>  7     2 0.981 field    https://openalex.org/fields/18      Decision Sciences   
#>  8     2 0.981 domain   https://openalex.org/domains/2      Social Sciences     
#>  9     3 0.918 topic    https://openalex.org/T11937         Data Sharing and St…
#> 10     3 0.918 subfield https://openalex.org/subfields/1710 Information Systems 
#> 11     3 0.918 field    https://openalex.org/fields/17      Computer Science    
#> 12     3 0.918 domain   https://openalex.org/domains/3      Physical Sciences   
#> 
#> [[2]]
#> # A tibble: 12 × 5
#>        i score name     id                                  display_name        
#>    <int> <dbl> <chr>    <chr>                               <chr>               
#>  1     1 1     topic    https://openalex.org/T10103         Development of Read…
#>  2     1 1     subfield https://openalex.org/subfields/3204 Developmental and E…
#>  3     1 1     field    https://openalex.org/fields/32      Psychology          
#>  4     1 1     domain   https://openalex.org/domains/2      Social Sciences     
#>  5     2 0.999 topic    https://openalex.org/T11345         Development of Nume…
#>  6     2 0.999 subfield https://openalex.org/subfields/2613 Statistics and Prob…
#>  7     2 0.999 field    https://openalex.org/fields/26      Mathematics         
#>  8     2 0.999 domain   https://openalex.org/domains/3      Physical Sciences   
#>  9     3 0.990 topic    https://openalex.org/T13106         Neuroscience and Ed…
#> 10     3 0.990 subfield https://openalex.org/subfields/2805 Cognitive Neuroscie…
#> 11     3 0.990 field    https://openalex.org/fields/28      Neuroscience        
#> 12     3 0.990 domain   https://openalex.org/domains/1      Life Sciences
trangdata commented 7 months ago

@yjunechoe looks great! This is a lot cleaner! Thank you!! Do you want to make a PR for this? I can as well, just curious how this would interfere with your existing PRs.

yjunechoe commented 7 months ago

I can do it! Will ping you in the PR when it's ready

trangdata commented 7 months ago

@yjunechoe Awesome! 🙏🏽 Oh and feel free to refactor if you see anything that needs it, or we can do it in a separate PR. A process_paper_authors function would make it more modular for example.