ropensci / openalexR

Getting bibliographic records from OpenAlex
https://docs.ropensci.org/openalexR/
Other
89 stars 19 forks source link

`oa_fetch()`: Filter by journal #249

Closed rempsyc closed 1 month ago

rempsyc commented 1 month ago

I would like to do an openalex query for papers (works) while filtering for a list of specific journals. I can fetch the info for entity = sources with no problem:

library(openalexR)

journals <- c("Collabra. Psychology", "Personality & Social Psychology Bulletin")

sources <- oa_fetch(
  entity = "sources",
  display_name.search = journals
)

sources$display_name
#> [1] "Personality & social psychology bulletin"
#> [2] "Collabra. Psychology"

However, I am not able to filter by journal when using entity = works. display_name.search, unlike the original query, here searches for titles of the papers instead of journals. The best I can do is a manual search, but then there is no exact match and I have been unable to specify an exact match.

works <- oa_fetch(
  entity = "works",
  search = "Collabra. Psychology"
)

works$so |> unique() |> head()
#> [1] "Collabra. Psychology"                                       
#> [2] NA                                                           
#> [3] "Social Science Research Network"                            
#> [4] "\u0098The \u009cSocial service review/Social service review"
#> [5] "American journal of sociology"                              
#> [6] "Nature"

The openalex documentation specifies:

Surrounding a phrase with quotation marks will search for an exact match of that phrase, after stemming and stop-word removal (be sure to use double quotation marks — “).

So we can try again with double quotes, but it seems to be ignored:

works <- oa_fetch(
  entity = "works",
  search = '"Collabra. Psychology"'
)

works$so |> unique() |> head()
#> [1] "Collabra. Psychology"                                       
#> [2] NA                                                           
#> [3] "Social Science Research Network"                            
#> [4] "\u0098The \u009cSocial service review/Social service review"
#> [5] "American journal of sociology"                              
#> [6] "Nature"

Even specifying journal= doesn’t seem to produce the expected result:

works <- oa_fetch(
  entity = "works",
  search = '"journal=Collabra. Psychology"'
)

works$so |> unique() |> head()
#> [1] "PloS one"    "Commonplace" NA

I do not find an explicit parameter to specify journal in the documented list of filters: https://docs.ropensci.org/openalexR/articles/Filters.html. The official openalex documentation does document a journal argument:

journal Value: the OpenAlex ID for a given source, where the source is type: journal Returns: works where the chosen source ID is the primary_location.source.

But I am not able to make it work:

works <- oa_fetch(
  entity = "works",
  journal.search = journals
)
#> Error: OpenAlex API request failed [403]
#> Invalid query parameters error.
#> <journal.search is not a valid field. Valid fields are underscore or hyphenated versions of: abstract.search, apc_list.currency, apc_list.provenance, apc_list.value, apc_list.value_usd, apc_paid.currency, apc_paid.provenance, apc_paid.value, apc_paid.value_usd, author.id, author.orcid, authors_count, authorships.author.id, authorships.author.orcid, authorships.countries, authorships.institutions.continent, authorships.institutions.country_code, authorships.institutions.id, authorships.institutions.is_global_south, authorships.institutions.lineage, authorships.institutions.ror, authorships.institutions.type, authorships.is_corresponding, best_oa_location.is_accepted, best_oa_location.is_oa, best_oa_location.is_published, best_oa_location.landing_page_url, best_oa_location.license, best_oa_location.source.host_organization, best_oa_location.source.host_organization_lineage, best_oa_location.source.id, best_oa_location.source.is_in_doaj, best_oa_location.source.is_oa, best_oa_location.source.issn, best_oa_location.source.type, best_oa_location.version, best_open_version, biblio.first_page, biblio.issue, biblio.last_page, biblio.volume, cited_by, cited_by_count, cited_by_percentile_year.max, cited_by_percentile_year.min, cites, concept.id, concepts.id, concepts.wikidata, concepts_count, corresponding_author_ids, corresponding_institution_ids, countries_distinct_count, default.search, display_name, display_name.search, doi, doi_starts_with, from_created_date, from_publication_date, fulltext.search, fulltext_origin, grants.award_id, grants.funder, has_abstract, has_doi, has_embeddings, has_fulltext, has_ngrams, has_oa_accepted_or_published_version, has_oa_submitted_version, has_old_authors, has_orcid, has_pdf_url, has_pmcid, has_pmid, has_raw_affiliation_strings, has_references, ids.mag, ids.openalex, ids.pmcid, ids.pmid, indexed_in, institution.id, institutions.continent, institutions.country_code, institutions.id, institutions.is_global_south, institutions.ror, institutions.type, institutions_distinct_count, is_corresponding, is_oa, is_paratext, is_retracted, journal, keyword.search, keywords.id, keywords.keyword, language, locations.is_accepted, locations.is_oa, locations.is_published, locations.landing_page_url, locations.license, locations.license_id, locations.source.has_issn, locations.source.host_institution_lineage, locations.source.host_organization, locations.source.host_organization_lineage, locations.source.id, locations.source.is_in_doaj, locations.source.is_oa, locations.source.issn, locations.source.publisher_lineage, locations.source.type, locations.version, locations_count, mag, mag_only, oa_status, open_access.any_repository_has_fulltext, open_access.is_oa, open_access.oa_status, openalex, openalex_id, pmcid, pmid, primary_location.is_accepted, primary_location.is_oa, primary_location.is_published, primary_location.landing_page_url, primary_location.license, primary_location.license_id, primary_location.source.has_issn, primary_location.source.host_institution_lineage, primary_location.source.host_organization, primary_location.source.host_organization_lineage, primary_location.source.id, primary_location.source.is_in_doaj, primary_location.source.is_oa, primary_location.source.issn, primary_location.source.publisher_lineage, primary_location.source.type, primary_location.version, primary_topic.domain.id, primary_topic.field.id, primary_topic.id, primary_topic.subfield.id, publication_date, publication_year, raw_affiliation_strings.search, referenced_works, referenced_works_count, related_to, repository, semantic.search, sustainable_development_goals.id, sustainable_development_goals.score, title.search, title_and_abstract.search, to_publication_date, to_updated_date, topics.domain.id, topics.field.id, topics.id, topics.subfield.id, topics_count, type, type_crossref, version>

works <- oa_fetch(
  entity = "works",
  journal = journals
)
#> Error: OpenAlex API request failed [403]
#> Invalid query parameters error.
#> <'Collabra. Psychology' is not a valid OpenAlex ID.>

So it seems like that argument expects an OpenAlex ID. This is a bit troublesome because then the workaround requires two steps: 1 to fetch the correc ID from search, and then one to have the proper search.

works <- oa_fetch(
  entity = "works",
  journal = sources$id
)

works$so |> unique() |> head()
#> [1] "Personality & social psychology bulletin"
#> [2] "Collabra. Psychology"

Is there a way to do this in one step?

Created on 2024-05-12 with reprex v2.1.0

trangdata commented 1 month ago

Unfortunately, currently OpenAlex doesn't allow search for works by journal name. In the openalex documentation you linked to:

journal Value: the OpenAlex ID for a given source, where the source is type: journal Returns: works where the chosen source ID is the primary_location.source.

So yes, this argument needs to be an OpenAlex ID. Doing it in the two steps you have done is the best way to search in this case. Also, using primary_location.source.id as a filter is the same as journal. I prefer primary_location.source.id to guard against future changes made by OpenAlex.

library(openalexR)

journals <- c("Collabra. Psychology", "Personality & Social Psychology Bulletin")
sources <- oa_fetch(
  entity = "sources",
  display_name.search = journals
)
sources$display_name
#> [1] "Personality & social psychology bulletin"
#> [2] "Collabra. Psychology"
sources$id
#> [1] "https://openalex.org/S187348256"  "https://openalex.org/S4210175756"
works <- oa_fetch(
  entity = "works",
  # journal = sources$id,
  primary_location.source.id = sources$id,
  options = list(sample = 10)
)

works$so |> unique() |> head()
#> [1] "Personality & social psychology bulletin"
#> [2] "Collabra. Psychology"

Created on 2024-05-12 with reprex v2.0.2

rempsyc commented 1 month ago

I see, interesting, thanks! Closing this issue then :)

trangdata commented 2 weeks ago

Adding OpenAlex's reasoning for future reference: https://docs.openalex.org/api-entities/works/search-works#why-cant-i-search-by-name-of-related-entity-author-name-institution-name-etc

Why can't you do this in just one step? Well, if you use the search term, "NYU," you might end up missing the ones that use the full name "New York University," rather than the initials. Sure, you could try to think of all possible variants and search for all of them, but you might miss some, and you risk putting in search terms that let in works that you're not interested in. Figuring out which works are actually associated with the "NYU" you're interested shouldn't be your responsibility—that's our job! We've done that work for you, so all the relevant works should be associated with one unique ID.