ropensci / openalexR

Getting bibliographic records from OpenAlex
https://docs.ropensci.org/openalexR/
Other
102 stars 21 forks source link

oa_fetch does not parse fields into tibbles #301

Open giuliogcantone opened 3 days ago

giuliogcantone commented 3 days ago

oa_fetch( entity = "topic", id = "https://openalex.org/fields/11" )

gives the lexical error.

yjunechoe commented 3 days ago

fields is not one of the OpenAlex entites that we currently support:

oa_entities()
#> [1] "works"        "authors"      "institutions" "concepts"     "keywords"     "funders"     
#> [7] "sources"      "publishers"   "topics"

Could you describe your use-case? For the time being, you can get the contents of that with:

httr::content(httr::GET("https://api.openalex.org/fields/11"))

Some dev notes (do you have thoughts, @trangdata?)

I'm actually a bit confused about how much fields stands on its own as an entity - I don't see it as an entry in the API docs, but it does appear as a property of Topic objects: https://docs.openalex.org/api-entities/topics/topic-object#field

There's a diagram in their classification whitepaper explaining more

image

yjunechoe commented 3 days ago

After looking into fields a bit more, my impression is that it's not interesting in and of itself, but it may be useful for finding topics.

If that's your usecase, you can use field as a filter on a search for topics:

oa_fetch("topics", field.id = 11)
#> # A tibble: 235 × 16
#>    id           display_name description keywords ids   subfield_id subfield_display_name field_id field_display_name
#>    <chr>        <chr>        <chr>       <list>   <lis> <chr>       <chr>                 <chr>    <chr>             
#>  1 https://ope… Evolution a… This clust… <chr>    <chr> https://op… Ecology, Evolution, … https:/… Agricultural and …
#>  2 https://ope… Diversity a… This clust… <chr>    <chr> https://op… Plant Science         https:/… Agricultural and …
#>  3 https://ope… Impact of P… This clust… <chr>    <chr> https://op… Ecology, Evolution, … https:/… Agricultural and …
#>  4 https://ope… Physiology … This clust… <chr>    <chr> https://op… Plant Science         https:/… Agricultural and …
#>  5 https://ope… Animal Nutr… This clust… <chr>    <chr> https://op… Animal Science and Z… https:/… Agricultural and …
#>  6 https://ope… Genetic and… This clust… <chr>    <chr> https://op… Plant Science         https:/… Agricultural and …
#>  7 https://ope… Factors Aff… This clust… <chr>    <chr> https://op… Animal Science and Z… https:/… Agricultural and …
#>  8 https://ope… Vascular Fl… This clust… <chr>    <chr> https://op… Plant Science         https:/… Agricultural and …
#>  9 https://ope… Metabolism … This clust… <chr>    <chr> https://op… Aquatic Science       https:/… Agricultural and …
#> 10 https://ope… Viral RNA S… This clust… <chr>    <chr> https://op… Plant Science         https:/… Agricultural and …
#> # ℹ 225 more rows
#> # ℹ 7 more variables: domain_id <chr>, domain_display_name <chr>, siblings <list>, works_count <int>,
#> #   cited_by_count <int>, updated_date <chr>, created_date <chr>
giuliogcantone commented 3 days ago

"domain", "fields", "subfields" are the hierarchical levels of topics. Currently OpenAlexr only fetches the last level of topics (arguably the less interesting). Fields are not an entity different than "topic", however, their structure is different, so yet they may require to be coded as different entities.

Probably you can easily code them into topics just leaving columns with NAs.

yjunechoe commented 3 days ago

We'd be happy to consider adding support for these higher-levels, but I also want to take this as an opportunity to know more about the usecase. Could you let me know what you might plan to do with information in the fields object, once you have it in a data frame?

giuliogcantone commented 2 days ago

We'd be happy to consider adding support for these higher-levels, but I also want to take this as an opportunity to know more about the usecase. Could you let me know what you might plan to do with information in the fields object, once you have it in a data frame?

The field coincides with the 26 disciplinary areas of Scopus which is remarkable by itself since it maps 2 different sources. In addition, high-level topics connect better works with careers. Authors are more qualified by "He is a dentist" than "He has published on remedies against caries"; so in general working with authors (instead of works) one wants to fetch information at high level, since often low levels are uninformative or vague in the evaluation of authors.

yjunechoe commented 2 days ago

In addition, high-level topics connect better works with careers. Authors are more qualified by "He is a dentist" than "He has published on remedies against caries"; so in general working with authors (instead of works) one wants to fetch information at high level, since often low levels are uninformative or vague in the evaluation of authors.

I appreciate this, and I'm trying to translate research questions like that ("what career/field does author X work in?") into a workflow in code.

The crucial question is whether such workflows require {openalexR} to be able to directly query a higher-level object

And here's my hesitation on that. For example, if I query an author, it comes with Topic information attached:

x <- oa_random("authors")
x$topics
#> [[1]]
#> # A tibble: 60 × 5
#>        i count id                                  display_name                                         type  
#>    <int> <int> <chr>                               <chr>                                                <chr> 
#>  1     1     4 https://openalex.org/T10263         Eating Disorders and Body Image Concerns             topic 
#>  2     1     4 https://openalex.org/subfields/3203 Clinical Psychology                                  subfi…
#>  3     1     4 https://openalex.org/fields/32      Psychology                                           field 
#>  4     1     4 https://openalex.org/domains/2      Social Sciences                                      domain
#>  5     2     3 https://openalex.org/T11123         Obsessive-Compulsive Disorder and Related Conditions topic 
#>  6     2     3 https://openalex.org/subfields/3203 Clinical Psychology                                  subfi…
#>  7     2     3 https://openalex.org/fields/32      Psychology                                           field 
#>  8     2     3 https://openalex.org/domains/2      Social Sciences                                      domain
#>  9     3     1 https://openalex.org/T10853         Cognitive Mechanisms of Anxiety and Depression       topic 
#> 10     3     1 https://openalex.org/subfields/3205 Experimental and Cognitive Psychology                subfi…
#> # ℹ 50 more rows

As you can see, {openalexR} (specifically, topics2df()) already breaks down topics such that their higher-level categorization also becomes available.

So at a quick glance, I can say something like "this author is a researcher in the Social Sciences who works in Psychology, specifically Clinical Psychology, studying various Disorders, especially Eating Disorders":

library(tidyverse)
x$topics[[1]] %>% 
  count(type, display_name, wt = count, name = "total_count", sort = TRUE) %>% 
  split(~ type)
#> $domain
#> # A tibble: 3 × 3
#>   type   display_name    total_count
#>   <chr>  <chr>                 <int>
#> 1 domain Social Sciences          14
#> 2 domain Health Sciences           5
#> 3 domain Life Sciences             1
#> 
#> $field
#> # A tibble: 6 × 3
#>   type  display_name                        total_count
#>   <chr> <chr>                                     <int>
#> 1 field Psychology                                   12
#> 2 field Medicine                                      4
#> 3 field Business, Management and Accounting           1
#> 4 field Neuroscience                                  1
#> 5 field Nursing                                       1
#> 6 field Social Sciences                               1
#> 
#> $subfield
#> # A tibble: 10 × 3
#>    type     display_name                                         total_count
#>    <chr>    <chr>                                                      <int>
#>  1 subfield Clinical Psychology                                           10
#>  2 subfield Psychiatry and Mental health                                   2
#>  3 subfield Applied Psychology                                             1
#>  4 subfield Cognitive Neuroscience                                         1
#>  5 subfield Experimental and Cognitive Psychology                          1
#>  6 subfield Marketing                                                      1
#>  7 subfield Nutrition and Dietetics                                        1
#>  8 subfield Pharmacology                                                   1
#>  9 subfield Public Health, Environmental and Occupational Health           1
#> 10 subfield Sociology and Political Science                                1
#> 
#> $topic
#> # A tibble: 15 × 3
#>    type  display_name                                                    total_count
#>    <chr> <chr>                                                                 <int>
#>  1 topic Eating Disorders and Body Image Concerns                                  4
#>  2 topic Obsessive-Compulsive Disorder and Related Conditions                      3
#>  3 topic Borderline Personality Disorder: Psychopathology and Treatment            1
#>  4 topic Cognitive Mechanisms of Anxiety and Depression                            1
#>  5 topic Epidemiology and Management of Sexual Dysfunction                         1
#>  6 topic Global Trends in Obesity and Overweight Research                          1
#>  7 topic Impact of Nutrition and Eating Habits on Health                           1
#>  8 topic Impact of Social Media on Well-being and Behavior                         1
#>  9 topic Influence of Appearance Management Behavior in Consumer Choices           1
#> 10 topic Interoception and Somatic Symptoms                                        1
#> 11 topic Molecular Mechanisms of Depression Treatment Strategies                   1
#> 12 topic Neurobiological Mechanisms of Placebo and Nocebo Effects                  1
#> 13 topic Pathological Gambling and Comorbid Disorders                              1
#> 14 topic Psychological Effects of Perfectionism                                    1
#> 15 topic Theories of Behavior Change and Self-Regulation                           1

If the analysis is serious about mapping an author's topics/subfields/fields/domains/etc., you can write functions that consume this data in various ways, e.g., a function to graph out an author's research areas. And as far as I can tell, this workflow doesn't require querying fields/subfields/domains directly, as you can get to those info via the topics object (even if topics itself isn't interesting). So I think I'm still looking for a good, solid usecase for your feature request - am I missing anything here?