nestauk / dap_aria_mapping

Mapping technology innovation to support The Advanced Research and Innovation Agency (ARIA)
MIT License
1 stars 0 forks source link

Patents Pipeline #22

Closed india-kerle closed 1 year ago

india-kerle commented 1 year ago

Description

This PR contains: 1) An initial analysis of the # of patents in google bigquery between 2016-2021 where at least one inventor is from the UK; 2) The full, initial patents pipeline of querying google bigquery for patents data, cleaning it and saving it to s3.

This PR closes:

19

3

4

Instructions for Reviewer

To pull data from google bigquery:

python dap_aria_mapping/pipeline/data_collection/patents.py run

BEWARE!!! If production=true, this will cost money! I accidentally charged myself £5.36 for the full dataset. I've since changed cards to be nesta's business card but lets not run it in production unless it feels like we've got to re-collect data.

To post process the data:

python dap_aria_mapping/pipeline/data_collection/processed_patents.py run

This reformats all the date columns, extracts english titles and abstracts and unnests the assignee and inventor to be in a list of strings format instead of a list of dict format.

It also creates a look up table sam will need to extract entities.

In order to test the code in this PR you need to ...

pytest dap_aria_mapping/pipeline/data_collection/tests/test_patents.py

To make sure that the clean patents dataset:

pytest dap_aria_mapping/pipeline/data_collection/tests/test_processed_patents.py To test the unnest_column and extract_english_text processing functions

Please pay special attention to ...

running python dap_aria_mapping/pipeline/data_collection/patents.py run in production - it will cost money!

Checklist:

Jack-Vines commented 1 year ago

Side note: this is gonna cause merge issues with my branch, but once this is merged i'll fix them on my branch, then re-request a review