ropensci / openalexR

Getting bibliographic records from OpenAlex
https://docs.ropensci.org/openalexR/
Other
89 stars 19 forks source link

Clarifying oa_fetch() implementation: searching for phrases v single words; stability; and concise code. #251

Closed DebsKing closed 3 weeks ago

DebsKing commented 1 month ago

Hello.

Thank you for the brilliant package. I have three questions:

  1. In the example search string below, I search for "Biomed" OR "Biomed engineering" (as a mock example). Having run quality checks on the results, I am not convinced that it is treating 'biomed engineering' as a phrase, rather than individual words. Is my my coding incorrect?

  2. When I repeat a given search string on the same day, it returns an identical number of publications. But when I repeat the search on consecutive days, it returns a few more publications each time. One might expect a small number of historical publications to be added to the Open Alex database, but some of my searches that go from 2019-2023 are returning 6% more publications when I run the code this month, compared to last month. I would like to clarify if this is due to on-going changes to the database, or the function oa_fetch(), or my code implementation.
    I did see some discussion on stability between the Open Alex database and R package https://github.com/ropensci/openalexR/issues/247.

  3. I want to search for words and phrases in the title OR abstract. I currently run this in two code chunks. Can I combine these for efficient code?

Thank you for again for the package, and for any help or guidance. It is really appreciated! Deborah

My r code:

remotes::install_github("ropensci/openalexR") # following recent issue with package, I now install via Github. packageVersion("openalexR") # 1.3.1 library(openalexR)

1. Search based on title

works_title <- oa_fetch( entity = "works", title.search = c("Biomed", "Biomed engineering"), # mock example from_publication_date = "2019-01-01", to_publication_date = "2022-12-31", # mock example cited_by_count = ">1", options = list(sort = "cited_by_count:desc"), verbose = TRUE )

2. Search based on abstract

works_abstract <- oa_fetch( entity = "works", abstract.search = c("Biomed", "Biomed engineering"), from_publication_date = "2019-01-01",to_publication_date = "2022-12-31", cited_by_count = ">1", options = list(sort = "cited_by_count:desc"),verbose = TRUE )

3. Quality checks:

count(works_abstract[duplicated(works_abstract$id), ]) # Are there duplicates within a dataframe # no count(works_title[duplicated(works_title$id), ]) # Are there duplicates within a dataframe # no

common_publications <- intersect(works_title$id, works_abstract$id) # Are there duplicates across the 'title' and 'abstract' dataframes length(common_publications) # yes, as one would expect.

4. Combine abstract and title dataframes:

works_title_filtered <- works_title %>% # Filter rows in works_title where id is not in works_abstract filter(!(id %in% works_abstract$id))

works_combined <- bind_rows(works_abstract, works_title_filtered) # Combine the original works_abstract with the filtered works_title

count(works_combined[duplicated(works_combined$id), ]) # check no duplicates

5. put into bibliometrix format

works_combined <- oa2bibliometrix(works_combined)

rkrug commented 1 month ago

Hi Deborah

Am also a very happy user of openalexR and I use it daily for title and abstract searches, for long search terms which include individual words and terms combined by OR.

My comments are inline

Hello. Thank you for the brilliant package.

Can not agree more! I have three questions:

In the example search string below, I search for "Biomed" OR "Biomed engineering" (as a mock example). Having run quality checks on the results, I am not convinced that it is treating 'biomed engineering' as a phrase, rather than individual words. Is my my coding incorrect?

When you search for "Biomed" OR "Biomed engineering”, the result is all results from “Biomed” and all results from the search for “Biomes engineering” - in other words, the second set is contained in the first one - wo it is redundant and you should get the same results then searching for “Biomed” only.

When you search in open Alex for ‘ X Y’ (without the inverted comms), it is automatically assuming that there is and AND between the terms. This is also true when you look at the API call your command is issuing:

https://api.openalex.org/works?filter=title.search%3ABiomed%7CBiomed%20engineering%2Cfrom_publication_date%3A2019-01-01%2Cto_publication_date%3A2022-12-31%2Ccited_by_count%3A%3E1&sort=cited_by_count%3Adesc

You see the term Biomed%7CBiomed%20engineering%2C https://api.openalex.org/works?filter=title.search%3ABiomed%7CBiomed%20engineering%2Cfrom_publication_date%3A2019-01-01%2Cto_publication_date%3A2022-12-31%2Ccited_by_count%3A%3E1&sort=cited_by_count%3Adesc which has a %7C, which is the escaped hex code for “|”, which stands for an AND. So your search is "Biomed" AND "Biomed engineering” - which is only “Biomed engineering”.

Therefore you have to use "Biomed" OR "Biomed engineering” as the search term.

Also, I have never used a vector of length larger then one for the for a search string, and if I would have, I would have expected either an OR, or even a vectorised version returning two results (but this is a different discussion)

When I repeat a given search string on the same day, it returns an identical number of publications. But when I repeat the search on consecutive days, it returns a few more publications each time. One might expect a small number of historical publications to be added to the Open Alex database, but some of my searches that go from 2019-2023 are returning 6% more publications when I run the code this month, compared to last month. I would like to clarify if this is due to on-going changes to the database, or the function oa_fetch(), or my code implementation. I did see some discussion on stability between the Open Alex database and R package #247 https://github.com/ropensci/openalexR/issues/247.

OpenAlex is growing and continuously ingesting sources. So if new works (and I use ‘works’ on purpose here as they are also datasets and not only articles) appear in any of the sources, they will be added. So an increase is too be expected. I usually download the results to a search on OpenAlex and store it as an element in a list, where the second element is the timestamp when the OpenAlex access took place. So this is expected. I want to search for words and phrases in the title OR abstract. I currently run this in two code chunks. Can I combine these for efficient code?

Yes - I do this regularly. You have to use title_and_abstract.search to do this:

openalexR::oa_fetch( title_and_abstract.search = ‘Biomed OR “Biomed engineering"', output = "list", verbose = TRUE )

One other point t consider is stemming. One example where seeming is misleading is the search for “Researcher” in title and abstract. This returns, unexpectedly, the same results as “Research”. One has to be aware of this when building search queries. Also, stemming is also done in the inverted commas (“Researcher bias” returns the same result as “research bias”).

Cheers,

Rainer

Thank you for again for the package, and for any help or guidance. It is really appreciated! Deborah

My r code:

following recent issue with package, I now install via Github.

remotes::install_github("ropensci/openalexR") packageVersion("openalexR") # 1.3.1 library(openalexR)

  1. Search based on title

works_title <- oa_fetch( entity = "works", title.search = c("Biomed", "Biomed engineering"), # mock example from_publication_date = "2019-01-01", to_publication_date = "2022-12-31", # mock example cited_by_count = ">1", options = list(sort = "cited_by_count:desc"), verbose = TRUE )

  1. Search based on abstract

works_abstract <- oa_fetch( entity = "works", abstract.search = c("Biomed", "Biomed engineering"), from_publication_date = "2019-01-01",to_publication_date = "2022-12-31", cited_by_count = ">1", options = list(sort = "cited_by_count:desc"),verbose = TRUE )

  1. Quality checks:

Are there duplicates within a dataframe:

count(works_abstract[duplicated(works_abstract$id), ]) # no count(works_title[duplicated(works_title$id), ]) # no

Are there duplicates across the 'title' and 'abstract' dataframes:

common_publications <- intersect(works_title$id, works_abstract$id) length(common_publications) # yes, as one would expect.

  1. Combine abstract and title dataframes:

Filter rows in works_title where id is not in works_abstract

works_title_filtered <- works_title %>% filter(!(id %in% works_abstract$id))

Combine the original works_abstract with the filtered works_title

works_combined <- bind_rows(works_abstract, works_title_filtered)

check no duplicates:

count(works_combined[duplicated(works_combined$id), ])

  1. put into bibliometrix format

works_combined <- oa2bibliometrix(works_combined)

— Reply to this email directly, view it on GitHub https://github.com/ropensci/openalexR/issues/251, or unsubscribe https://github.com/notifications/unsubscribe-auth/AADW6BCPK5GHP36CHRFQI3TZDGVLJAVCNFSM6AAAAABH7H55VSVHI2DSMVQWIX3LMV43ASLTON2WKOZSGMYDKMZYGU3DSOI. You are receiving this because you are subscribed to this thread.

-- Rainer M. Krug, PhD (Conservation Ecology, SUN), MSc (Conservation Biology, UCT), Dipl. Phys. (Germany)

Orcid ID: 0000-0002-7490-0066

Department of Evolutionary Biology and Environmental Studies University of Zürich Office Y19-M-72 Winterthurerstrasse 190 8075 Zürich Switzerland

Office: +41 (0)44 635 47 64 Cell: +41 (0)78 630 66 57 email: @. @.

PGP: 0x0F52F982

DebsKing commented 1 month ago

Thanks for your time and help!

My search string above – “biomed” OR “biomed engineering” – was a bad example due to the word repetition. Apologies. A closer example to my search string is: "medicine" OR "biomed engineering". Importantly, "biomed engineering" needs to be treated as a single phrase and not as two words. Using inverted commas appears to fix this: ' medicine OR "biomed engineering" '

Thank you for suggesting use of ‘title_and_abstract.search’. I want to search titles OR abstracts, and cannot implement ‘title_or_abstract.search’. Are you aware of any options to encode this, please? I cannot see any in the help / documentation.

Thank you for highlighting the stemming issue: One other point t consider is stemming. One example where seeming is misleading is the search for “Researcher” in title and abstract. This returns, unexpectedly, the same results as “Research”. One has to be aware of this when building search queries. Also, stemming is also done in the inverted commas (“Researcher bias” returns the same result as “research bias”). This seems an important limitation, if it applies to all instances of: -er, -ing, -ies, etc. Is there any work around?

Thanks again.

rkrug commented 1 month ago

Thanks for your time and help!

Pleasure.

My search string above – “biomed” OR “biomed engineering” – was a bad example due to the word repetition. Apologies. A closer example to my search string is: "medicine" OR "biomed engineering". Importantly, "biomed engineering" needs to be treated as a single phrase and not as two words. Using inverted commas appears to fix this: ' medicine OR "biomed engineering" '

Good.

Thank you for suggesting use of ‘title_and_abstract.search’. I want to search titles OR abstracts, and cannot implement ‘title_or_abstract.search’. Are you aware of any options to encode this, please? I cannot see any in the help / documentation.

title_and_abstract,search searches the title and the abstract for the term m - so if either has the term, it will be returned. This is not the logical AND - it effectively searches in abstract and title and when one is true, it returns it.

Thank you for highlighting the stemming issue: One other point t consider is stemming. One example where seeming is misleading is the search for “Researcher” in title and abstract. This returns, unexpectedly, the same results as “Research”. One has to be aware of this when building search queries. Also, stemming is also done in the inverted commas (“Researcher bias” returns the same result as “research bias”). This seems an important limitation, if it applies to all instances of: -er, -ing, -ies, etc. Is there any work around?

Thanks again.

Glad that I could help.