research-software-directory / RSD-as-a-service

This repo contains the new RSD-as-a-service implementation
https://research.software
27 stars 14 forks source link

feat: allow harvesting citations of OpenAlex reference papers #1306

Closed ewan-escience closed 1 month ago

ewan-escience commented 1 month ago

Scrape citations from OpenAlex reference papers

Changes proposed in this pull request

How to test

To do

Migration:

Before dropping the external_id column and after adding the openalex_id column, the following (untested) query should be executed:

UPDATE mention SET openalex_id = external_id WHERE external_id ~ '^https://openalex\.org/[WwAaSsIiCcPpFf]\d{3,13}$';

The following was tested in production, yielding a result of 5629

SELECT COUNT(*) FROM mention WHERE external_id ~ '^https://openalex\.org/[WwAaSsIiCcPpFf]\d{3,13}$';

The following gave the same result of 5629:

SELECT COUNT(*) FROM mention WHERE external_id IS NOT NULL;

To check for unique entries, run

SELECT COUNT(DISTINCT(LOWER(external_id))) FROM mention WHERE external_id ~ '^https://openalex\.org/[WwAaSsIiCcPpFf]\d{3,13}$';

which again yielded 5629.

If you do have duplicate entries, you can get them with:

SELECT LOWER(external_id), COUNT(LOWER(external_id)) FROM mention WHERE external_id ~ '^https://openalex\.org/[WwAaSsIiCcPpFf]\d{3,13}$' GROUP BY LOWER(external_id) HAVING COUNT(LOWER(external_id)) > 1;

Closes #1291

PR Checklist:

sonarcloud[bot] commented 1 month ago

Quality Gate Passed Quality Gate passed for 'rsd-database'

Issues
0 New issues
0 Accepted issues

Measures
0 Security Hotspots
0.0% Coverage on New Code
0.0% Duplication on New Code

See analysis details on SonarCloud

sonarcloud[bot] commented 1 month ago

Quality Gate Failed Quality Gate failed for 'scrapers'

Failed conditions
C Reliability Rating on New Code (required ≥ A)

See analysis details on SonarCloud

Catch issues before they fail your Quality Gate with our IDE extension SonarLint

sonarcloud[bot] commented 1 month ago

Quality Gate Passed Quality Gate passed for 'rsd-frontend'

Issues
0 New issues
0 Accepted issues

Measures
0 Security Hotspots
0.0% Coverage on New Code
0.0% Duplication on New Code

See analysis details on SonarCloud

jmaassen commented 1 month ago

Works as expected.

One question: given that we can get most of the information for https://openalex.org/W3159002838 after scraping, couldn't we also use this identifier to import the mention in the first place? The "Search for DOI or title" box could add the OpenAlexID?

ewan-escience commented 1 month ago

One question: given that we can get most of the information for https://openalex.org/W3159002838 after scraping, couldn't we also use this identifier to import the mention in the first place? The "Search for DOI or title" box could add the OpenAlexID?

Yes, that's what I meant with the second TODO in the PR description. 🙂 I will open issues for the TODOs.