Closed psforscher closed 5 months ago
Another possibility is that journals are correctly fetched by PubMed, but that the continent is often not correctly identified (in part because of #15, e.g., because economics journals might have more diverse authors with diverse affiliations which are harder to correctly identify the continent for). This could be systematically investigated by checking each journal directly on the PubMed website. That said, we can already see that the problem is not only continent identification from the count table (your second screenshot) since there are already few papers even before filtering for continent. But it could be a combination of both issues contributing to making the main issue worse.
Possibly useful: https://library.mskcc.org/blog/2015/12/confirming-that-a-journal-is-indexed-in-medline-andor-pubmed/
There are essentially three statuses that a journal can have in PubMed:
- Every article in the journal is indexed in the Medline database (ie. Index Medicus) and PubMed. [...]
- Every article published in the journal will be indexed in PubMed but not in Medline. [...]
- Only select articles from the journal that have been deposited in PubMed Central (PMC) in order to comply with the NIH Public Access Policy will appear in PubMed. The journal is otherwise not officially indexed in Medline nor PubMed. This implies the least potential for visibility as an article published in this journal will have limited viewership [...]
Perhaps with some journals we are dealing with case no 3
That would make sense. Psychology is closer to medicine than is economics.
For reference, it might be possible to get more coverage through the Crossref API: https://cran.r-project.org/web/packages/rcrossref/rcrossref.pdf
I can't find country-of-origin in the Crossref API documentation. It does look like there are some good bibliometric tools in the rOpenSci project (but I'm not sure any of them are quite what we need) https://ropensci.org/packages/literature/
hmm, the bibliometrix
package seems like it could be promising for integrating other scholarly databases. it seems to allow access to author country-of-origin; see attached screenshot. not sure if it is an exact match to our purposes, but it could be worth a closer look.
(from https://www.bibliometrix.org/vignettes/Introduction_to_bibliometrix.html)
Thanks for the great suggestions from the rOpenSci project! I have now given some thought and testing to most of them. There are many interesting alternative options indeed. Let's do a quick overview.
europepmc
: Relies on PubMed data so probably same issue as currently.tidypmc
: Relies on PubMed data so probably same issue as currently.bibliometrix
: for sure looks interesting, but it seems to rely on manually downloading the data from the main data bases, so it would be hard to reconcile with our desired goal of automatically updating the dashboard every week.refsplitr
: designed to parse affiliation address, so a very interesting tool relying on Web of Science data, but seems like the data has to be manually downloaded so not a complete solution.rcrossref
: very good, but does not provide country information, so this has to be deducted from the affiliation information (like currently for easyPubMed
)openalexR
: This looks like our best option, as it is the only one I know that provides country information and all the relevant papers.Here is a comparison of data for a couple of our low-count journals:
data.frame(engines, n_articles, affiliation, city, country) %>%
kable()
engines | n_articles | affiliation | city | country |
---|---|---|---|---|
pubmedDashboard | 37 | yes | no | no |
rcrossref | 1000 | yes | no | no |
openalexR | 5620 | yes | no | yes |
pubmedDashboard %>%
count(pubmedDashboard = journal) %>%
kable()
pubmedDashboard | n |
---|---|
Collabra. Psychology | 16 |
Journal of African Economies | 9 |
Review of African Political Economy | 8 |
Review of International Political Economy | 4 |
# rcrossref seems to have no ability to do an exact match :/
rcrossref %>%
count(rcrossref = container.title) %>%
kable()
# because of API limits, I could only ask for one journal here (Collabra)
rcrossref | n |
---|---|
Collabra | 29 |
Collabra: Psychology | 487 |
Psychology | 439 |
Psychology (Psychology Revivals) | 18 |
Psychology. | 24 |
Supplemental Information 1: Collabra Peer Review | 3 |
openalexR %>%
count(openalexR = so) %>%
kable()
openalexR | n |
---|---|
Collabra. Psychology | 487 |
Journal of African economies | 1097 |
Review of African political economy | 2601 |
Review of international political economy | 1435 |
# openalexR ends up giving the same number for Collabra as rcrossref
# This is encouraging!
Created on 2024-05-12 with reprex v2.1.0
What is openAlex?
OpenAlex: The open catalog to the global research system
- OpenAlex is a free and open catalog of the global research system. It's named after the ancient Library of Alexandria and made by the nonprofit OurResearch.
- We index over 250M scholarly works from 250k sources, with extra coverage of humanities, non-English languages, and the Global South.
I don't think I will have time to change our whole infrastructure in time for SIPS in June (#33)... But perhaps after that?
This is very good progress though because not only does OpenAlex have many more journals, by providing the country directly, it saves us a lot of processing time parsing the affiliations, on the one hand, and reduces the risks of errors and missing values since their parsing system is for sure more reliable than the one I tried to develop so far. This is a great move forward. Of course, it is possible that new issues arise as I explore the package further and proceed with the integration with the rest of the package. If we go this way, I will probably have to rename the pubmedDashboard
package 😅
wow, this is great! well done 🤩
indeed, it's almost certainly too ambitious to change everything in time for June, and counter to our agreement from our last meeting that you should work at your own pace. however, i think this is a very high priority for after June -- maybe even top priority -- as this seems to solve a lot of database issues (though of course we don't know yet if it will raise new issues itself)
Thanks! Yeah, it is very encouraging, a really nice find. So far it seems like all of the old data from PubMed can still be accessed or recovered through openAlex. Plus, there are much more info opening the door to other visualisations in the future: info also about last author, middle authors, corresponding author, information about open access (which could be useful for Sakshi's paper? So it could be possible to only filter for open access and make a separate dashboard for example).
Massive open index of scholarly papers launches
The [OpenAlex] database, which launched on 3 January [2022], is a replacement for Microsoft Academic Graph (MAG), a free alternative to subscription-based platforms such as Scopus, Dimensions and Web of Science that was discontinued at the end of 2021.
In response to MAG’s closure, non-profit scholarly services firm OurResearch in Vancouver, Canada, created OpenAlex, using part of a US$4.5-million grant from London-based charity Arcadia Fund.
“It’s just pulling lots of databases together in a clever way,” says Euan Adie, founder of Overton, a London-based firm that tracks the research cited in policy documents.
OpenAlex draws its data from MAG’s existing records and from other sources including Wikidata identifiers, ORCID, Crossref and ROR, says Jason Priem, co-founder of OurResearch.
(I opened another issue to explore problems with OpenAlex, but it is mostly technical, #46)
Now that we have implemented the transition to OpenAlex, we can see from the new data that the field of economics (and general journals) is a lot more representative of the global south than psychology - though mostly less US-centric (as, I think, you expected once we got access to good data). That is an entire new area of discussion. You will see the stats and graphs for all fields has thus changed accordingly.
As mentioned in issue #31, the over-time graph for economics has some odd patterns. These seem to be caused by PubMed's poor coverage of economics journals. For example, as shown in the table for the economics graph, some years in the graph are based on only a few articles.
This can also be seen in the "Journal count" table given on the methods page; the queries to
easyPubMed
seem to consistently retrieve a relatively small number of articles per journal.We should investigate why the queries to PubMed (through
easyPubMed
) retrieve so few articles.