16 write pipeline and utils to collect openalex papers based on cited by

Overview

This closes #16, incorporates some useful basics for other kedro pipelines (modular pipelines, partitioned datasets) but also collates findings of hours trying to get dynamic folder path setting right. It lacks unit tests, to be done in the future.

Implementation Details

Main Pipeline

The pipeline is designed to collect incoming and outgoing citations for a given work (cited_by vs cites).
Modular pipelines (incoming_pipeline and outgoing_pipeline) are utilized for flexibility.
A downstream_impact_pipeline is included to assess the impact of a work, defined as cited_by citations of all papers that cite the parent paper (intended to be used with the AF paper).

Dynamic Path Configuration Experiment

The key challenge addressed is the dynamic creation of subfolders in S3 paths for partitioned datasets based on the work_id.
The intended functionality was to use work_id directly in the DataCatalog configuration for dynamic subfolder creation.

Current Solution

Modified Kedro settings to read parameters as globals, which are IO aware. This allows templating save paths using parameters.
This approach successfully creates dynamic subfolders for downstream_incoming_citations.

Failed solutions

Several methods were explored to achieve dynamic path configuration:
- Hooks (after_node_run): Suggested by Kedro community after talks with them via Slack. However, it did not yield the desired result, so I'm shelving it for now.
- Config Resolvers (OmegaConfig): Attempted but failed to meet the requirements.
- Custom JSON Dataset: Creating a dataset for subfolder management; this was also not optimal.
The chosen solution, while functional, blurs the line between parameters and globals, which is not ideal from a data engineering perspective.

Future

Investigate more robust methods for dynamic path configuration that maintain a clear separation between parameters and globals.
Exploring the possibility of extending Kedro's functionality or integrating more seamlessly with existing features like hooks or custom datasets - it's highly unlikely there's no better way of accessing the data catalog with pipeline parameters.
Some of the behavior is intentional, as it's best practice in DE to have pipelines be I/O unaware (makes for easier versioning ad code maintenance). The current solution undermines this.

Sum up

This PR includes the full implementation of the OpenAlex pipeline with a suboptimal solution for dynamic path configuration.

innovation-growth-lab / alphafold-impact

16 write pipeline and utils to collect openalex papers based on cited by #17