rempsyc / busara_dashboard

The Missing Majority in Behavioural Science Dashboard
https://remi-theriault.com/dashboards/missing_majority
1 stars 0 forks source link

Investigation of reasons for lack of economics journal coverage in queries to PubMed #34

Closed psforscher closed 3 months ago

psforscher commented 5 months ago

As mentioned in issue #31, the over-time graph for economics has some odd patterns. These seem to be caused by PubMed's poor coverage of economics journals. For example, as shown in the table for the economics graph, some years in the graph are based on only a few articles.

Screenshot 2024-04-11 at 17 34 09

This can also be seen in the "Journal count" table given on the methods page; the queries to easyPubMed seem to consistently retrieve a relatively small number of articles per journal.

Screenshot 2024-04-11 at 17 28 03

We should investigate why the queries to PubMed (through easyPubMed) retrieve so few articles.

rempsyc commented 5 months ago

Another possibility is that journals are correctly fetched by PubMed, but that the continent is often not correctly identified (in part because of #15, e.g., because economics journals might have more diverse authors with diverse affiliations which are harder to correctly identify the continent for). This could be systematically investigated by checking each journal directly on the PubMed website. That said, we can already see that the problem is not only continent identification from the count table (your second screenshot) since there are already few papers even before filtering for continent. But it could be a combination of both issues contributing to making the main issue worse.

rempsyc commented 5 months ago

Possibly useful: https://library.mskcc.org/blog/2015/12/confirming-that-a-journal-is-indexed-in-medline-andor-pubmed/

There are essentially three statuses that a journal can have in PubMed:

  • Every article in the journal is indexed in the Medline database (ie. Index Medicus) and PubMed. [...]
  • Every article published in the journal will be indexed in PubMed but not in Medline. [...]
  • Only select articles from the journal that have been deposited in PubMed Central (PMC) in order to comply with the NIH Public Access Policy will appear in PubMed. The journal is otherwise not officially indexed in Medline nor PubMed. This implies the least potential for visibility as an article published in this journal will have limited viewership [...]

Perhaps with some journals we are dealing with case no 3

psforscher commented 5 months ago

That would make sense. Psychology is closer to medicine than is economics.

For reference, it might be possible to get more coverage through the Crossref API: https://cran.r-project.org/web/packages/rcrossref/rcrossref.pdf

psforscher commented 5 months ago

I can't find country-of-origin in the Crossref API documentation. It does look like there are some good bibliometric tools in the rOpenSci project (but I'm not sure any of them are quite what we need) https://ropensci.org/packages/literature/

psforscher commented 4 months ago

hmm, the bibliometrix package seems like it could be promising for integrating other scholarly databases. it seems to allow access to author country-of-origin; see attached screenshot. not sure if it is an exact match to our purposes, but it could be worth a closer look.

Screenshot 2024-04-18 at 16 10 05

(from https://www.bibliometrix.org/vignettes/Introduction_to_bibliometrix.html)

rempsyc commented 4 months ago

Thanks for the great suggestions from the rOpenSci project! I have now given some thought and testing to most of them. There are many interesting alternative options indeed. Let's do a quick overview.

  1. europepmc: Relies on PubMed data so probably same issue as currently.
  2. tidypmc: Relies on PubMed data so probably same issue as currently.
  3. bibliometrix: for sure looks interesting, but it seems to rely on manually downloading the data from the main data bases, so it would be hard to reconcile with our desired goal of automatically updating the dashboard every week.
  4. refsplitr: designed to parse affiliation address, so a very interesting tool relying on Web of Science data, but seems like the data has to be manually downloaded so not a complete solution.
  5. rcrossref: very good, but does not provide country information, so this has to be deducted from the affiliation information (like currently for easyPubMed)
  6. openalexR: This looks like our best option, as it is the only one I know that provides country information and all the relevant papers.
rempsyc commented 4 months ago

Here is a comparison of data for a couple of our low-count journals:

Preparation code
``` r # Systematic comparison journals <- c("Collabra. Psychology", "Review of African Political Economy", "Review of International Political Economy", "Journal of African Economies") # pubmedDashboard / easyPubMed library(pubmedDashboard) save_process_pubmed_batch( journal = journals, year_low = 1980, year_high = 2030, api_key = API_TOKEN_PUBMED, data_folder = "test") #> pubmed_query_string = #> "Collabra. Psychology" [Journal] OR "Review of African Political Economy" [Journal] OR "Review of International Political Economy" [Journal] OR "Journal of African Economies" [Journal] AND ("1980/01/01" [Date - Publication] : "2030/12/31" [Date - Publication]) #> 1/5 - Downloading PubMed data... [8:17:14 PM] #> [1] "PubMed data batch 1 / 1 downloaded..." #> 2/5 - Converting XLM files to dataframe... [8:17:18 PM] #> 3/5 - Extracting affiliations... [8:17:37 PM] #> 4/5 - Matching universities to countries... [8:17:37 PM] #> 5/5 - Identifying countries and continents... [8:17:45 PM] #> Operation successfully completed. Congratulations! [8:17:53 PM] #> File saved in test/articles_1980_2030.rds pubmedDashboard <- read_bind_all_data(data_folder = "test") #> (0 duplicates removed) pubmedDashboard <- clean_journals_continents(pubmedDashboard) # Glimpse variables names(pubmedDashboard) #> [1] "journal" "year" "country_code" "country" #> [5] "region" "continent" "university" "university_old" #> [9] "department" "address" "lastname" "firstname" #> [13] "month" "day" "jabbrv" "title" #> [17] "doi" "pmid" "abstract" "date" #> [21] "original_journal" "field" "first_Year" "last_year" #> [25] "year_range" # rcrossref library(rcrossref) # The max limit is 1000 records retrieved per call # So doesn't work well for us since Collabra data is excluded # Let's choose Collabra alone for comparison rcrossref <- cr_works(flq = c(`query.container-title` = "Collabra. Psychology"), limit = 1000)$data # Glimpse variables names(rcrossref) #> [1] "container.title" "created" "deposited" #> [4] "published.print" "published.online" "doi" #> [7] "indexed" "issn" "issue" #> [10] "issued" "member" "prefix" #> [13] "publisher" "score" "source" #> [16] "reference.count" "references.count" "is.referenced.by.count" #> [19] "title" "type" "url" #> [22] "volume" "abstract" "language" #> [25] "author" "link" "license" #> [28] "reference" "funder" "isbn" #> [31] "page" "short.container.title" "update.policy" #> [34] "subtitle" "alternative.id" # Pull affiliation rcrossref$author[[1]] %>% select(contains("affiliation")) #> # A tibble: 2 × 2 #> affiliation1.name affiliation2.name #> #> 1 Experimental-Clinical and Health Psychology 1 , Ghent University, Ghent, Belg… #> 2 Experimental-Clinical and Health Psychology 1 , Ghent University, Ghent, Belg… # openalexR library(openalexR) sources <- oa_fetch( entity = "sources", display_name.search = journals ) openalexR <- oa_fetch( entity = "works", journal = sources$id ) # Glimpse variables names(openalexR) #> [1] "id" "title" #> [3] "display_name" "author" #> [5] "ab" "publication_date" #> [7] "so" "so_id" #> [9] "host_organization" "issn_l" #> [11] "url" "pdf_url" #> [13] "license" "version" #> [15] "first_page" "last_page" #> [17] "volume" "issue" #> [19] "is_oa" "is_oa_anywhere" #> [21] "oa_status" "oa_url" #> [23] "any_repository_has_fulltext" "language" #> [25] "grants" "cited_by_count" #> [27] "counts_by_year" "publication_year" #> [29] "cited_by_api_url" "ids" #> [31] "doi" "type" #> [33] "referenced_works" "related_works" #> [35] "is_paratext" "is_retracted" #> [37] "concepts" "topics" # Pull affiliation openalexR$author[[2]]$au_affiliation_raw #> [1] "Maxwell Graduate School of Citizenship and Public Affairs , Syracuse University ," # Pull university openalexR$author[[2]]$institution_display_name #> [1] "Syracuse University" # Pull country openalexR$author[[2]]$institution_country_code #> [1] "US" # Final engines <- c("pubmedDashboard", "rcrossref", "openalexR") n_articles <- c(nrow(pubmedDashboard), nrow(rcrossref), nrow(openalexR)) affiliation <- "yes" city <- "no" country <- c("no", "no", "yes") ```
data.frame(engines, n_articles, affiliation, city, country) %>%
  kable()
engines n_articles affiliation city country
pubmedDashboard 37 yes no no
rcrossref 1000 yes no no
openalexR 5620 yes no yes

pubmedDashboard %>%
  count(pubmedDashboard = journal) %>%
  kable()
pubmedDashboard n
Collabra. Psychology 16
Journal of African Economies 9
Review of African Political Economy 8
Review of International Political Economy 4

# rcrossref seems to have no ability to do an exact match :/
rcrossref %>%
  count(rcrossref = container.title) %>%
  kable()
# because of API limits, I could only ask for one journal here (Collabra)
rcrossref n
Collabra 29
Collabra: Psychology 487
Psychology 439
Psychology (Psychology Revivals) 18
Psychology. 24
Supplemental Information 1: Collabra Peer Review 3

openalexR %>%
  count(openalexR = so) %>%
  kable()
openalexR n
Collabra. Psychology 487
Journal of African economies 1097
Review of African political economy 2601
Review of international political economy 1435
# openalexR ends up giving the same number for Collabra as rcrossref
# This is encouraging!

Created on 2024-05-12 with reprex v2.1.0

rempsyc commented 4 months ago

What is openAlex?

OpenAlex: The open catalog to the global research system

  • OpenAlex is a free and open catalog of the global research system. It's named after the ancient Library of Alexandria and made by the nonprofit OurResearch.
  • We index over 250M scholarly works from 250k sources, with extra coverage of humanities, non-English languages, and the Global South.

I don't think I will have time to change our whole infrastructure in time for SIPS in June (#33)... But perhaps after that?

This is very good progress though because not only does OpenAlex have many more journals, by providing the country directly, it saves us a lot of processing time parsing the affiliations, on the one hand, and reduces the risks of errors and missing values since their parsing system is for sure more reliable than the one I tried to develop so far. This is a great move forward. Of course, it is possible that new issues arise as I explore the package further and proceed with the integration with the rest of the package. If we go this way, I will probably have to rename the pubmedDashboard package 😅

psforscher commented 4 months ago

wow, this is great! well done 🤩

indeed, it's almost certainly too ambitious to change everything in time for June, and counter to our agreement from our last meeting that you should work at your own pace. however, i think this is a very high priority for after June -- maybe even top priority -- as this seems to solve a lot of database issues (though of course we don't know yet if it will raise new issues itself)

rempsyc commented 4 months ago

Thanks! Yeah, it is very encouraging, a really nice find. So far it seems like all of the old data from PubMed can still be accessed or recovered through openAlex. Plus, there are much more info opening the door to other visualisations in the future: info also about last author, middle authors, corresponding author, information about open access (which could be useful for Sakshi's paper? So it could be possible to only filter for open access and make a separate dashboard for example).

rempsyc commented 4 months ago

Massive open index of scholarly papers launches

The [OpenAlex] database, which launched on 3 January [2022], is a replacement for Microsoft Academic Graph (MAG), a free alternative to subscription-based platforms such as Scopus, Dimensions and Web of Science that was discontinued at the end of 2021.

In response to MAG’s closure, non-profit scholarly services firm OurResearch in Vancouver, Canada, created OpenAlex, using part of a US$4.5-million grant from London-based charity Arcadia Fund.

“It’s just pulling lots of databases together in a clever way,” says Euan Adie, founder of Overton, a London-based firm that tracks the research cited in policy documents.

OpenAlex draws its data from MAG’s existing records and from other sources including Wikidata identifiers, ORCID, Crossref and ROR, says Jason Priem, co-founder of OurResearch.

rempsyc commented 3 months ago

(I opened another issue to explore problems with OpenAlex, but it is mostly technical, #46)

rempsyc commented 3 months ago

Now that we have implemented the transition to OpenAlex, we can see from the new data that the field of economics (and general journals) is a lot more representative of the global south than psychology - though mostly less US-centric (as, I think, you expected once we got access to good data). That is an entire new area of discussion. You will see the stats and graphs for all fields has thus changed accordingly.