nestauk / dap_aria_mapping

Mapping technology innovation to support The Advanced Research and Innovation Agency (ARIA)
MIT License
1 stars 0 forks source link

Add first iteration of openalex pipeline #21

Closed Jack-Vines closed 1 year ago

Jack-Vines commented 1 year ago

Description

Adds the openalex works pipeline, which (generally) takes some parameters and queries openalex on a year by year basis. Also adds a postprocessing pipeline which separates the publications (and deinverts the abstracts) into:

  1. Works
  2. Concepts
  3. Authorships
  4. Citations
  5. Abstracts

It also has a production parameter which is set by default to False. When False, it only performs one year of collection (for debugging/testing). Also test files for any functions used in the pipeline.

Small aside, I couldn't get nesta_ds_utils to work in a batch environment so some saving/loading functions are a bit manual - i'll pick this up outside this PR though.

Closes #7

Instructions for Reviewer

In order to test the code in this PR you need to run the following 2 commands from the project root (they may take up to half an hour each): python dap_aria_mapping/pipeline/data_collection/openalex.py --package-suffixes=.txt --datastore=s3 run python dap_aria_mapping/pipeline/data_collection/processed_openalex.py --package-suffixes=.txt --datastore=s3 run

Checklist:

emily-bicks commented 1 year ago

Are you by chance using Python < 3.8? I think numpy dropped support for Python 3.7 and below in version 1.22.

On Mon, 5 Dec 2022 at 09:42, Jack Vines @.***> wrote:

@.**** commented on this pull request.

In requirements.txt https://github.com/nestauk/dap_aria_mapping/pull/21#discussion_r1039362339 :

@@ -1,2 +1,3 @@ -pandas +pandas==1.5.1 git+https://github.com/nestauk/nesta_ds_utils.git

I've amended this line so it works - but this isn't the error I was getting, I was getting things like this ERROR: Could not find a version that satisfies the requirement numpy==1.23.4 (from nesta-ds-utils)

— Reply to this email directly, view it on GitHub https://github.com/nestauk/dap_aria_mapping/pull/21#discussion_r1039362339, or unsubscribe https://github.com/notifications/unsubscribe-auth/A3ANJEUYKVKXWNEA2FXF2OLWLW2JVANCNFSM6AAAAAASPNXWHI . You are receiving this because you commented.Message ID: @.***>

--

Emily Bicks | Principal Data Scientist, Data Analytics Practice Pronouns: she/her

--

58 Victoria Embankment London EC4Y 0DS

            www.nesta.org.uk 

http://www.nesta.org.uk/ | Twitter http://www.twitter.com/nesta_uk | LinkedIn http://www.linkedin.com/groups?gid=1868227&goback=%2Egdr_1274367066783_1 | Facebook http://www.facebook.com/pages/NESTA/116788428355432?v=wall&ref=sgm

If you no longer want to receive emails from Nesta, send an email to  @. @.>. This email and any attachments are confidential and may be subject to legal privilege. Any use, copying or disclosure other than by the intended recipient is unauthorised. If you have received this message in error, please notify the sender immediately or by email to @. @.> and delete this message and any
copies from your computer and network. The views expressed in this email are those of the author and do not necessarily reflect the views of Nesta. Nesta is a company limited by guarantee and registered in England and Wales with company number 7706036 and charity number

  1. Registered as a charity in Scotland number SC042833. Registered office: 58 Victoria Embankment, London, EC4Y 0DS. 

--