This project aims to enhance the impact and team science metrics of UKRI-funded research publications by linking publication data from the Gateway to Research (GtR) database to OpenAlex. The methodology addresses the challenge of matching publications on self-reported, often incomplete data, through a dual approach for reverse-lookups when Digital Object Identifiers (DOI) are absent. The process includes steps to enhance citation data with contextual information and measure the interdisciplinary nature of research teams.
The core of the project involves linking GtR publication data with OpenAlex entries, even in the absence of DOIs. Our approach includes:
These methods ensure robust additional labelling, improving coverage for subsequent analysis.
Using Semantic Scholar’s API, we collect contextual citation information and categorise citations based on intent. We complement this with data from open-access full-text publications tagged by OpenAlex or available through CORE. A classification model will be trained to identify citation intent, enhancing our understanding of the influence of UKRI-funded research.
We evaluate the interdisciplinary nature of research teams using the methodology from Leydesdorff, Wagner, and Bornmann (2019). This includes:
We use the Leiden CWTS topics taxonomy for discipline classification.
The project delivers:
In order to work with this package, you will need to clone it using git, and install the package as editable:
pip install -e .
requirements.txt
should contain all necessary libraries, but note that scipdf requires some changes in the source code, namely in the parser.py
file to enable URL requests with no ".pdf" suffix. Note that scipdf requires both a spaCy language library as well as an instance of grobid to be running. See scipdf's repository for instructions on how to do this.
In addition, environment variables are required to use S3 file repositories. See kedro documentation for how to make a credentials.yml
file.
Execute the desired pipeline using the command kedro run --pipeline <pipeline_name>
. Replace <pipeline_name>
with the name of the pipeline you want to run (e.g., data_collection_gtr
).
The project is organised into several key directories and files, each serving a specific purpose within the Kedro framework:
pip install -r requirements.txt
.conf/
)The conf/
directory contains configuration files that define the parameters and settings used throughout the project. The structure is as follows:
src/
)The src/
directory contains the core codebase of the project, organised into submodules that correspond to different stages of the data processing pipeline:
dsit_impact/: Main package containing the implementation of the data pipelines and utilities.
datasets/: Contains modules for handling different types of datasets.
pdf_dataset.py: Module for processing PDF datasets.
__init__.py: Initialisation file for the datasets package.
pipelines/: Contains the different data processing pipelines.
data_analysis_team_metrics/: Pipeline for analysing interdisciplinary team metrics.
data_generation/: Pipeline for generating final datasets for analysis.
data_collection_gtr/: Pipeline for collecting data from the Gateway to Research (GtR) API.
data_matching_oa/: Pipeline for matching datasets with OpenAlex.
data_processing_authors/: Pipeline for processing author-related data.
data_collection_s2/: Pipeline for collecting data from Semantic Scholar (S2).
data_processing_pdfs/: Pipeline for processing PDF documents.
The project leverages the Kedro framework to create modular and reusable data pipelines. Each pipeline is responsible for a specific aspect of the project, such as data collection, processing, or analysis. Kedro's structure helps in organising the project, making it easy to extend and maintain.
This section details how the conceptual approach of the dsit-impact project is implemented within the codebase using specific pipelines. Each project narrative is closely linked to one or more code pipelines, which collectively process, analyse, and generate the required data and insights.
The section revolves around linking Gateway to Research (GtR) publication data with OpenAlex entries, especially in cases where Direct Object Identifier (DOI) matching is not feasible. This involves using advanced techniques to generate potential DOI matches and systematically merge them to improve data coverage.
Relevant Pipelines:
data_collection_gtr
: This pipeline is responsible for collecting publication data from the Gateway to Research (GtR) API. It handles the extraction of metadata that will later be used for matching with OpenAlex datasets.
data_matching_oa
: This pipeline deals with matching the GtR data with OpenAlex entries. It involves generating potential matches using metadata and refining these matches using techniques like cosine similarity.
utils
: Includes modules like oa_cr_merge.py
, oa_match.py
, cr.py
, and oa.py
that perform the matching operations and merge results.The section project aims to provide insights into how UKRI-funded research influences subsequent studies by analysing citation intent and section context from publications that cite UKRI-linked papers. This involves collecting and processing citation context data from Semantic Scholar and open-access full-text publications.
Relevant Pipelines:
data_collection_s2
: This pipeline is designed to collect citation context data from Semantic Scholar (S2). It gathers information on onward and backward citations, categorising each based on its intent (e.g., background, methodology, results).
data_processing_pdfs
: This pipeline processes open-access full-text publications to identify where in the text UKRI-linked papers are cited. It also extracts adjacent text to the citation, which can be used to train a model for identifying citation intent.
The section evaluates the interdisciplinary nature of research teams involved in UKRI-funded projects by implementing a methodology that studies diversity in disciplines through variety, balance, and disparity metrics.
Relevant Pipelines:
data_processing_authors
: This pipeline processes author-related data, linking authors to their respective disciplines and identifying patterns in their publication history. This data serves as the basis for computing the interdisciplinary metrics.
data_analysis_team_metrics
: This pipeline implements the methodology proposed by Leydesdorff, Wagner, and Bornmann (2019) to calculate interdisciplinary metrics for research teams. It brings together the variety of disciplines, balance of publishing behavior, and disparity between disciplines.
After all data has been collected, processed, and analysed, the final datasets are generated for further analysis or reporting. This stage consolidates the outputs of the various pipelines.
Relevant Pipeline:
data_generation
: This pipeline is responsible for generating the final datasets by integrating the outputs from the other pipelines. It prepares the data for subsequent analysis or reporting.