innovation-growth-lab / dsit-impact

Enhancing the impact and team science metrics of UKRI-funded research publications.
0 stars 0 forks source link

Impact and Team Science Metrics for UKRI-funded Research Publications

Summary

This project aims to enhance the impact and team science metrics of UKRI-funded research publications by linking publication data from the Gateway to Research (GtR) database to OpenAlex. The methodology addresses the challenge of matching publications on self-reported, often incomplete data, through a dual approach for reverse-lookups when Digital Object Identifiers (DOI) are absent. The process includes steps to enhance citation data with contextual information and measure the interdisciplinary nature of research teams.

Approach and Methodology

DOI Labelling and Dataset Matching

The core of the project involves linking GtR publication data with OpenAlex entries, even in the absence of DOIs. Our approach includes:

  1. CrossRef API Utilisation: Generating potential DOI matches using publication metadata.
  2. OpenAlex Query Searches: Systematic API searches on OpenAlex using metadata combinations and cosine similarity measures.

These methods ensure robust additional labelling, improving coverage for subsequent analysis.

Citation Intent and Section Identification

Using Semantic Scholar’s API, we collect contextual citation information and categorise citations based on intent. We complement this with data from open-access full-text publications tagged by OpenAlex or available through CORE. A classification model will be trained to identify citation intent, enhancing our understanding of the influence of UKRI-funded research.

Development and Implementation of Interdisciplinary Metrics

We evaluate the interdisciplinary nature of research teams using the methodology from Leydesdorff, Wagner, and Bornmann (2019). This includes:

We use the Leiden CWTS topics taxonomy for discipline classification.

Expected Outcomes

The project delivers:

  1. Categorised Dataset: A detailed dataset linking GtR publications to OpenAlex, including predicted DOIs and enhanced with context and intent impact metrics.
  2. Scalable and Reusable Code: Python-written, user-friendly code following Open Source principles and Nesta’s guidelines.
  3. Explanatory Documentation: An accessible notebook detailing methodologies, code functionalities, and troubleshooting tips.
  4. Continuous Collaboration: Ongoing work with DSIT to ensure proper code transfer.

How to install

In order to work with this package, you will need to clone it using git, and install the package as editable:

pip install -e .

requirements.txt should contain all necessary libraries, but note that scipdf requires some changes in the source code, namely in the parser.py file to enable URL requests with no ".pdf" suffix. Note that scipdf requires both a spaCy language library as well as an instance of grobid to be running. See scipdf's repository for instructions on how to do this.

In addition, environment variables are required to use S3 file repositories. See kedro documentation for how to make a credentials.yml file.

Getting started

Execute the desired pipeline using the command kedro run --pipeline <pipeline_name>. Replace <pipeline_name> with the name of the pipeline you want to run (e.g., data_collection_gtr).

Project Structure

The project is organised into several key directories and files, each serving a specific purpose within the Kedro framework:

Root Directory

Configuration Directory (conf/)

The conf/ directory contains configuration files that define the parameters and settings used throughout the project. The structure is as follows:

Source Code Directory (src/)

The src/ directory contains the core codebase of the project, organised into submodules that correspond to different stages of the data processing pipeline:

Kedro Framework Context

The project leverages the Kedro framework to create modular and reusable data pipelines. Each pipeline is responsible for a specific aspect of the project, such as data collection, processing, or analysis. Kedro's structure helps in organising the project, making it easy to extend and maintain.

Key Concepts in Kedro:

How the Project Ties to the Code Pipelines

This section details how the conceptual approach of the dsit-impact project is implemented within the codebase using specific pipelines. Each project narrative is closely linked to one or more code pipelines, which collectively process, analyse, and generate the required data and insights.

DOI Labeling and Dataset Matching

The section revolves around linking Gateway to Research (GtR) publication data with OpenAlex entries, especially in cases where Direct Object Identifier (DOI) matching is not feasible. This involves using advanced techniques to generate potential DOI matches and systematically merge them to improve data coverage.

Relevant Pipelines:

Citation Intent and Section Identification

The section project aims to provide insights into how UKRI-funded research influences subsequent studies by analysing citation intent and section context from publications that cite UKRI-linked papers. This involves collecting and processing citation context data from Semantic Scholar and open-access full-text publications.

Relevant Pipelines:

Team Science Metrics

The section evaluates the interdisciplinary nature of research teams involved in UKRI-funded projects by implementing a methodology that studies diversity in disciplines through variety, balance, and disparity metrics.

Relevant Pipelines:

Data Generation

After all data has been collected, processed, and analysed, the final datasets are generated for further analysis or reporting. This stage consolidates the outputs of the various pipelines.

Relevant Pipeline: