petermr / pygetpapers

a Python version of getpapers
Apache License 2.0
79 stars 9 forks source link

Add OpenAlex as repo #44

Open smierz opened 2 years ago

smierz commented 2 years ago

Hello, today I saw a tweet on Twitter (https://twitter.com/aarontay/status/1548674479601004545?t=n7WRkMg2ehfhfLnRlXZ-mA&s=19 ) and in the comments it was suggested that Openalex would be a good addition. While OpenAlex is not a repository itself, it is possible to query for works via search terms and if the work is Open Access there may be a link to the pdf included in the metadata which could be used for downloading the paper.

Do you think including OpenAlex would be a good fit?

I would be happy to help/implement the feature.

aarontaycheehsien commented 2 years ago

Yes, I think the fact that OpenAlex isn't a repository of OA content isn't a problem, since pygetpapers supports Crossref API which also isn't a repository of full text?

petermr commented 2 years ago

Sandra, that's a great offer!

Ayush has deliberately written pygetpapers to be extensible so that it should be possible to write a module which doesn't cause problems with existing code. We use python 3.8 (3.10 gives library problems).

We have regular daily meetings for our project (currently "semanticClimate") at 1230 UTC Mon-Fri on Zoom and you could drop in occasionally and chat and join our Slack. Or you can do it all on asynchronous (Github discussions https://github.com/petermr/petermr/discussions/10). Or you can work directly with Ayush. You can decide what suits you best.

Ayush is planning the next phase (a layered GUI on top of the commandline)

Peter (I know Heather and Jason at OpenAlex so if you are able to create something useful we can let them know).

On Wed, Jul 20, 2022 at 10:46 AM Aaron Tay @.***> wrote:

Yes, I think the fact that OpenAlex isn't a repository of OA content isn't a problem, since pygetpapers supports Crossref API which also isn't a repository of full text?

— Reply to this email directly, view it on GitHub https://github.com/petermr/pygetpapers/issues/44#issuecomment-1190063516, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAFTCSZCT4SGN2AB2JLDYUTVU7DHVANCNFSM535V4CWA . You are receiving this because you are subscribed to this thread.Message ID: @.***>

-- Peter Murray-Rust Founder ContentMine.org and Reader Emeritus in Molecular Informatics Dept. Of Chemistry, University of Cambridge, CB2 1EW, UK

ayush4921 commented 2 years ago

@smierz Hi! I do think OpenAlex is a great addition to pygetpapers. I played around with OpenAlex and diophila.

It shouldn't be too difficult adding something like diophila to pygetpapers. (habanero was a crossref wrapper that we integrated into pygetpapers).

smierz commented 2 years ago

Hi @ayush4921 , I already looked a bit around your code and you're right, shouldn't be too hard.

One thing that is currently missing in diophila though is the "search" parameter. While you can use a filter that searches the work's title, the search parameter looks into the title and the abstract (see the infobox in the works convenience filter section). I think this would be the preferred way of searching for a term and then get the papers, so I will add that to diophila beforehand 🙂

petermr commented 2 years ago

Thanks so much for getting back. and delighted you and @Ayush Garg @.***> can work together. Copied to Daniel Mietchen, Wikimedian and long-term collaborator

Search is very important!

Ayush is keen to add a scoring system to search hits, especially with the long download times for many people with poor connections.

One of the interesting challenges with OpenAlex (and Crossref, etc.) is that it's heavy on bibliography but light on content. I think the key possibility is Wikidata. OpenAlex already links into Wikidata through their concepts (about 65K) but these probably mainly reflect "bibliographic" items (Institutions, etc.) We want to add content-indexing and one method is through Wikidata's "main subject" field.

BTW we've worked a lot with Simon Worthington , Open Science TiB so I've copied him.

(We can take some of this discussion off-list now we've made contact)

Great to work together

P.

On Fri, Jul 22, 2022 at 7:38 AM Sandra Mierz @.***> wrote:

Hi @ayush4921 https://github.com/ayush4921 , I already looked a bit around your code and you're right, shouldn't be too hard.

One thing that is currently missing in diophila though is the "search" parameter. While you can use a filter that searches the work's title, the search parameter looks into the title and the abstract (see the infobox in the works convenience filter section https://docs.openalex.org/api/get-lists-of-entities/filter-entity-lists#works-convenience-filters ). I think this would be the preferred way of searching for a term and then get the papers, so I will add that to diophila beforehand 🙂

— Reply to this email directly, view it on GitHub https://github.com/petermr/pygetpapers/issues/44#issuecomment-1192236869, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAFTCS3FCLCBMU5BXCBBHY3VVI6WVANCNFSM535V4CWA . You are receiving this because you commented.Message ID: @.***>

-- Peter Murray-Rust Founder ContentMine.org and Reader Emeritus in Molecular Informatics Dept. Of Chemistry, University of Cambridge, CB2 1EW, UK

ayush4921 commented 2 years ago

Hi @ayush4921 , I already looked a bit around your code and you're right, shouldn't be too hard.

One thing that is currently missing in diophila though is the "search" parameter. While you can use a filter that searches the work's title, the search parameter looks into the title and the abstract (see the infobox in the works convenience filter section). I think this would be the preferred way of searching for a term and then get the papers, so I will add that to diophila beforehand 🙂

Sure @smierz! Do ping me once the search functionality has been added!

smierz commented 2 years ago

Ping! The search parameter is now included in diophila 0.4.0.

Here is a simple script to see how it works (for use in Jupyter notebook)

!pip install diophila

from diophila import OpenAlex
openalex = OpenAlex()

filters = {"is_oa": "true"}
search = '"invasive plant species"'
pages_of_works = openalex.get_list_of_works(filters=filters, search=search, pages=[1])

for page in pages_of_works:        # loop through pages
    print(page['meta']['count'])
    for work in page['results']:   # loop though list of works
        print(work['id'])