nestauk / dap_aria_mapping

Mapping technology innovation to support The Advanced Research and Innovation Agency (ARIA)
MIT License
1 stars 0 forks source link

Collect OpenAlex forward citations #38

Closed georgerichardson closed 1 year ago

georgerichardson commented 1 year ago

Description

This adds:

closes #37

Instructions for Reviewer

You can test the pipeline by running it without the production flag python dap_aria_mapping/pipeline/data_collection/openalex_forward_citations.py --datastore=s3 run

Tests can be run using pyest dap_aria_mapping/pipeline/data_collection/tests/test_openalex_forward_citations.py

Please pay special attention the flow generally. I'm not sure having to specify a year is the most logical approach to the collection and some of the parameters aren't tested in test mode. Most of the logic is contained in the flow so can't be easily unit tested. I'm also not sure if the way I'm saving and 'getting' the results is the most intuitive.

In particular, I think the way I'm handling imports during steps and the batched step is a bit of a clunky mix of approaches. Keen to generally get tips on best practice for that.

Checklist:

georgerichardson commented 1 year ago

@Jack-Vines not sure quite what I did there but seemed to push a file that was partially updated. Have updated and fixed now. Hoepfully this should work fine.

The original version of this flow took a min and max year, but I decided to simplify it and the output file names by only allowing one year or all. In practice, I'm not sure all years should ever be run in one go as it'd hit the OpenAlex limit. Running the same script repeatedly is also tedious though. Ideas for a future enhancement of a script or scheduling system welcome!