nasa-petal / petal-labeler-data-pipeline

0 stars 3 forks source link

Compare OpenAlex and SemanticScholar to see which one has better data quality. #23

Closed bruffridge closed 2 years ago

dsmith111 commented 2 years ago

Semantic Scholar

OpenAlex

Summary

Both APIs are sufficient for the pipeline. Most of the metadata would be more relevant for deciding how to pad the data set with related papers. For the deployed GitHub pipeline, both rate limitations are fine. The only experience we have with diminished quality is Semantic Scholar returning the wrong abstract for a paper. I believe OpenAlex has slightly more relevant data and would just recommend it over Semantic Scholar.

Reviewers @hschilling @bruffridge

hschilling commented 2 years ago

Good research work!

Any sense of which service has better coverage of papers we are interested in?

Good point about "Has multiple paper ids" but I thought SS did also

https://www.semanticscholar.org/product/api

image
dsmith111 commented 2 years ago

Ah, they weren't showing up because those example requests for the identifiers are using an older API. The newer one does show the external IDs. I'm currently running a script to find the coverage ratio for each of the APIs. I'll have the results commented here when finished.

dsmith111 commented 2 years ago

image

This seems really low for SS. It was utilizing the example/older API and I think the API requests themselves were struggling in some cases (returning JSON responses with early EOF?). I'm currently re-running it with the newer API; I'll share the hit rate for SS as an absolute and adjusted to any errors.

In any case, the issue with the JSON response would add unnecessary complexity to the pipeline for processing that can be avoided with OpenAlex.

dsmith111 commented 2 years ago

SS Actual Hit Ratio = 0.700938232994527 SS Excluding Errors Hit Ratio = 0.7013291634089132 OA Hit Ratio = 0.9667709147771697

I'll have to manually check some of these. Even with the new API and ignoring errors, nothing much changed.

dsmith111 commented 2 years ago

missed_oa_dois.txt missed_ss_dois.txt

I have verified the results. The hit ratio is based on the title and abstract. If the API cannot at least return both the abstract and the title of a document, it is considered to have "missed". The 30% of DOIs sent to Semantic Scholar either did not return an abstract or a title. I cross-checked the DOIs either API missed with the one that claimed to have it.

For any DOI that OpenAlex missed, if Semantic Scholar registered a hit for it, occasionally it was fine, but other times it returned something that was not truly an abstract, example:

"title": "The salvinia paradox: superhydrophobic surfaces with hydrophilic pins for air retention under water.", "abstract": "[*] Prof. W. Barthlott, S. Wiersch, Dr. H. F. Bohn Nees-Institut für Biodiversität der Pflanzen Rheinische Friedrich-Wilhelms-Universität Meckenheimer Allee 170, 53115 Bonn (Germany) E-mail: barthlott@uni-bonn.de Prof. Th. Schimmel, Dr. M. Barczewski, Dr. S. Walheim, A. Weis, A. Kaltenmaier Institute of Applied Physics and Center for Functional Nanostructures (CFN) University of Karlsruhe Karlsruhe Institute of Technology (KIT) 76131 Karlsruhe (Germany) Institute of Nanotechnology and Center for Functional Nanostructures (CFN) Forschungszentrum Karlsruhe Karlsruhe Institute of Technology (KIT) 76021 Karlsruhe (Germany) E-mail: thomas.schimmel@physik.uni-karlsruhe.de Prof. K. Koch Biologie und Nanobiotechnologie Hochschule Rhein-Waal Landwehr 4, 47533 Kleve (Germany)",

Based on the 30% of missed DOIs, along with some unknown portion of the 70% of hits being false positives, I heavily recommend OpenAlex over Semantic Scholar. I have attached two files containing the DOIs that either API missed for reference.

hschilling commented 2 years ago

Good job! Looks like OpenAlex it is. Thanks

dsmith111 commented 2 years ago

OpenAlex has been working well with the data pipeline over the past few months. Should be safe to close.