Sourcing Abstracts - Githubissues

sneakers-the-rat commented 10 months ago

(Placeholder issue to remind Jonny to write this when they get to their desk)

Crossref is missing a lot of abstracts, so we'll need to source them elsewhere as a background backfilling service.

Potential sources

openalex
semantic scholar
???

smierz commented 10 months ago

I could work on fetching abstracts from OpenAlex in case the field is empty in CrossRef.

sneakers-the-rat commented 10 months ago

Hell yeah. I think we should have openalex as a data source generally so yes!

We dont have a structure yet for the data sources, but you can get an idea of how they're working so far by taking a look at crossref.py. basically we are structuring them as functions that can also be run as background tasks, since thats how they'll mostly be used.

In this case we probably want to have a function for general fetching as well as one for backfilling (ie. Just try and fill x fields in the PaperCreate model). I was also going to take a crack at doing an ORCID data source so that might help us get at some more coherent structure, but for now feel free to hack away :)

smierz commented 10 months ago

I noticed that the tests are making real API calls every time they run. Any objections to using a recorder like pytest-recording?

sneakers-the-rat commented 10 months ago

No objection - the tests cache unique api calls so they only get made once, and I note there that we could can some responses if we want - https://github.com/sneakers-the-rat/journal-rss/blob/68848189e9102c39ea2f2e476c49cf77d9e7082f/tests/conftest.py#L21

I think it would be good to keep one live API call per (external) endpoint just to detect changes in the API itself, but hell ya, by all means add a recorder and some canned data to unit tests

smierz commented 10 months ago

started working on querying OpenAlex for abstracts, but then stopped, because docs mention:

"OpenAlex doesn't include plaintext abstracts due to legal constraints."

switched to query for journal's homepage_url for now -> https://github.com/sneakers-the-rat/paper-feeds/pull/31

sneakers-the-rat commented 10 months ago

good catch, but also check this out

>>> import requests
>>> work = requests.get('https://api.openalex.org/works/W2741809807').json()
>>> abstract_inverted = work['abstract_inverted_index']

>>> inverted = {}
>>> for word, positions in abstract_inverted.items():
>>>     for position in positions:
>>>         inverted[position] = word

>>> words = [inverted[i] for i in range(max(inverted.keys()))]
>>> abstract = ' '.join(words)

>>> print(abstract)
Despite growing interest in Open Access (OA) to scholarly literature, there is an unmet need for large-scale, up-to-date, and reproducible studies assessing the prevalence and characteristics of OA. We address this need using oaDOI, an open online service that determines OA status for 67 million articles. We use three samples, each of 100,000 articles, to investigate OA in three populations: (1) all journal articles assigned a Crossref DOI, (2) recent journal articles indexed in Web of Science, and (3) articles viewed by users of Unpaywall, an open-source browser extension that lets users find OA articles using oaDOI. We estimate that at least 28% of the scholarly literature is OA (19M in total) and that this proportion is growing, driven particularly by growth in Gold and Hybrid. The most recent year analyzed (2015) also has the highest percentage of OA (45%). Because of this growth, and the fact that readers disproportionately access newer articles, we find that Unpaywall users encounter OA quite frequently: 47% of articles they view are OA. Notably, the most common mechanism for OA is not Gold, Green, or Hybrid OA, but rather an under-discussed category we dub Bronze: articles made free-to-read on the publisher website, without an explicit Open license. We also examine the citation impact of OA articles, corroborating the so-called open-access citation advantage: accounting for age and discipline, OA articles receive 18% more citations than average, an effect driven primarily by Green and Hybrid OA. We encourage further research using the free oaDOI service, as a way to inform OA policy and

so it seems like it is just literally split by ' ' and truncated to 257?

smierz commented 10 months ago

yea, I know you could do it (see my notebook), but what I meant was:
--> OpenAlex does not include abstracts due to legal constraints --> if we program a server that stores abstracts on it, the same constraints probably apply to us (or better: the person running the software on their server)

maybe a way around it would be to check for license of work before getting/recreating the abstracts ?

sneakers-the-rat / paper-feeds

Sourcing Abstracts #17