This closes #16, incorporates some useful basics for other kedro pipelines (modular pipelines, partitioned datasets) but also collates findings of hours trying to get dynamic folder path setting right. It lacks unit tests, to be done in the future.
Implementation Details
Main Pipeline
The pipeline is designed to collect incoming and outgoing citations for a given work (cited_by vs cites).
Modular pipelines (incoming_pipeline and outgoing_pipeline) are utilized for flexibility.
A downstream_impact_pipeline is included to assess the impact of a work, defined as cited_by citations of all papers that cite the parent paper (intended to be used with the AF paper).
Dynamic Path Configuration Experiment
The key challenge addressed is the dynamic creation of subfolders in S3 paths for partitioned datasets based on the work_id.
The intended functionality was to use work_id directly in the DataCatalog configuration for dynamic subfolder creation.
Current Solution
Modified Kedro settings to read parameters as globals, which are IO aware. This allows templating save paths using parameters.
This approach successfully creates dynamic subfolders for downstream_incoming_citations.
Failed solutions
Several methods were explored to achieve dynamic path configuration:
Hooks (after_node_run): Suggested by Kedro community after talks with them via Slack. However, it did not yield the desired result, so I'm shelving it for now.
Config Resolvers (OmegaConfig): Attempted but failed to meet the requirements.
Custom JSON Dataset: Creating a dataset for subfolder management; this was also not optimal.
The chosen solution, while functional, blurs the line between parameters and globals, which is not ideal from a data engineering perspective.
Future
Investigate more robust methods for dynamic path configuration that maintain a clear separation between parameters and globals.
Exploring the possibility of extending Kedro's functionality or integrating more seamlessly with existing features like hooks or custom datasets - it's highly unlikely there's no better way of accessing the data catalog with pipeline parameters.
Some of the behavior is intentional, as it's best practice in DE to have pipelines be I/O unaware (makes for easier versioning ad code maintenance). The current solution undermines this.
Sum up
This PR includes the full implementation of the OpenAlex pipeline with a suboptimal solution for dynamic path configuration.
Overview
This closes #16, incorporates some useful basics for other kedro pipelines (modular pipelines, partitioned datasets) but also collates findings of hours trying to get dynamic folder path setting right. It lacks unit tests, to be done in the future.
Implementation Details
Main Pipeline
cited_by
vscites
).incoming_pipeline
andoutgoing_pipeline
) are utilized for flexibility.downstream_impact_pipeline
is included to assess the impact of a work, defined ascited_by
citations of all papers that cite the parent paper (intended to be used with the AF paper).Dynamic Path Configuration Experiment
work_id
.work_id
directly in the DataCatalog configuration for dynamic subfolder creation.Current Solution
downstream_incoming_citations
.Failed solutions
after_node_run
): Suggested by Kedro community after talks with them via Slack. However, it did not yield the desired result, so I'm shelving it for now.Future
Sum up
This PR includes the full implementation of the OpenAlex pipeline with a suboptimal solution for dynamic path configuration.