1d. Matching to OpenAlex - OA Queries

Methodological Readme for Collecting Data from OpenAlex using DOI and Reverse Lookups with Crossref and OpenAlex

Overview

This blob outlines the methodology for collecting data from OpenAlex using DOI, performing reverse lookups to match missing records using Crossref, and using OpenAlex's own matching capabilities through search works. The pipeline is implemented using Kedro, and the nodes and utilities provided are designed to handle the preprocessing, fetching, and matching tasks.

Preprocessing Steps

Preprocess DOIs
- Extract DOIs from the input data and ensure they are in a standard format using regex.
- Function: preprocess_publication_doi(df: pd.DataFrame) -> pd.DataFrame
- Key Decision: Only consider valid DOIs that match the pattern 10\..+.
Create DOI Input List
- Generate a list of unique DOI values from the preprocessed data.
- Group DOIs for efficient querying if specified.
- Function: create_list_doi_inputs(df: pd.DataFrame, **kwargs) -> list
- Key Decision: Use grouping to reduce the number of API calls (OpenAlex manages up to 50 AND/OR logic statements per call). We also parallelise, with rate limits at 10 requests per second. We use 8 workers to guarantee not hitting the limit.

Fetching Data from OpenAlex

Fetch Papers
- Use OpenAlex API to fetch papers based on the provided DOIs.
- Parallelise requests to optimise performance.
- Function: fetch_papers(ids: Union[List[str], List[List[str]]], mailto: str, perpage: int, filter_criteria: Union[str, List[str]], parallel_jobs: int = 8) -> Dict[str, List[Callable]]
- Key Decision: Limit chunks to 80 IDs for each API call, as this balances between query complexity and API rate limits.
Concatenate OpenAlex Data
- Combine the partitioned JSON datasets into a single DataFrame.
- Function: concatenate_openalex(data: Dict[str, AbstractDataset]) -> pd.DataFrame
- Key Decision: Efficiently handle and combine large datasets using pandas.

Reverse Lookup using Crossref

Match DOIs with Crossref

Use Crossref API to match missing DOIs by providing bibliographic information.

Key Functions and Process

Setting Up the Session

Establish a session with retry strategies to handle transient errors and rate limits.
Function: setup_session()

Details:

retry_strategy = Retry(total=3, status_forcelist=[429, 500, 502, 503, 504], allowed_methods=["HEAD", "GET", "OPTIONS"])
adapter = HTTPAdapter(max_retries=retry_strategy)
session = requests.Session()
session.mount("https://", adapter)
session.mount("http://", adapter)

Cleaning HTML Entities
- Clean and preprocess the input records to ensure accurate matching.
- Function: clean_html_entities(input_record: Dict[str, Union[str, int, float]]) -> Dict[str, Union[str, int, float]]
- Details:
```
return {key: unescape(value.replace("&", "and")) if isinstance(value, str) else value for key, value in input_record.items()}
```
Formulating the Query
- Construct a query string from the bibliographic information.
- Function: get_doi(outcome_id: str, title: str, author: str, journal: str, publication_date: str, mailto: str, session: requests.Session) -> Dict[str, str]
- Details:
```
query = f"{title}, {author}, {journal}, {publication_date}"
url = f'https://api.crossref.org/works?query.bibliographic="{query}"&mailto={mailto}&rows=5'
```

Fetching and Processing Results

Send the query to the Crossref API and process the returned results.
Extract relevant information and calculate fuzzy scores for matching.
Function: _process_item(item: Dict[str, Union[str, Dict[str, str]]], title: str, author: str, journal: str, publication_date: str) -> Union[Dict[str, Union[str, int, float]], None]

Details:

cr = {
 "title": item["title"][0],
 "author": f"{item['author'][0]['family']}, {item['author'][0]['given']}",
 "journal": item["container-title"][0],
 "year": year,
 "doi": item["DOI"].lower(),
 "score": item["score"],
 "year_diff": abs(year - int(publication_date[:4])),
}
fuzzy_scores = [
 fuzz.token_set_ratio(title.lower(), cr["title"].lower()),
 fuzz.token_set_ratio(author.lower(), cr["author"].lower()),
]
if journal:
 fuzzy_scores.append(fuzz.token_set_ratio(journal, cr["journal"]))
cr["fuzzy_score"] = sum(fuzzy_scores) / len(fuzzy_scores)
return cr if cr["year_diff"] <= 1 else None

Selecting the Best Match

From the processed results, select the best match based on a composite score.
Function: _select_best_match(outcome_id: str, matches: List[Dict[str, Union[str, int, float]]]) -> Union[Dict[str, Union[str, int, float]], None]

Details:

best_match = None
highest_score = 0
for match in matches:
 cr_score = match["score"]
 cr_fuzzy_score = match["fuzzy_score"]
 composite_score = cr_score + cr_fuzzy_score
 if all([composite_score > highest_score, cr_score > 60, cr_fuzzy_score > 60]):
     highest_score = composite_score
     match["outcome_id"] = outcome_id
     best_match = match
return best_match

Batch Processing

Process records in batches to handle large datasets efficiently.
Function: crossref_doi_match(oa_data: pd.DataFrame, gtr_data: pd.DataFrame, mailto: str) -> Generator[Dict[str, pd.DataFrame], None, None]

Details:

unmatched_data = gtr_data[~gtr_data["doi"].isin(oa_data["doi"])]
inputs = unmatched_data[["outcome_id", "title", "author", "journal_title", "publication_date"]].to_dict(orient="records")
cleaned_inputs = [clean_html_entities(record) for record in inputs]
input_batches = [cleaned_inputs[i: i + 250] for i in range(0, len(cleaned_inputs), 250)]
for i, batch in enumerate(input_batches):
 session = setup_session()
 results = Parallel(n_jobs=4, verbose=10)(delayed(get_doi)(x["outcome_id"], x["title"], x["author"], x["journal_title"], x["publication_date"], mailto, session) for x in batch)
 results = [r for r in results if r]
 df = pd.DataFrame(results)
 if df.empty:
     continue
 df = df.merge(pd.DataFrame(batch), on="outcome_id", how="right", suffixes=("_cr", "_gtr"))
 yield {f"s{i}": df}

Reverse Lookup using OpenAlex

Search and Match in OpenAlex

Perform searches in OpenAlex for unmatched records using bibliographic information.

Key Functions and Process

Cleaning HTML Entities for OpenAlex
- Clean and preprocess input records for OpenAlex searches.
- Function: clean_html_entities_for_oa(input_record: Dict[str, Union[str, int, float]]) -> Dict[str, Union[str, int, float, Dict[str, str]]]
- Details:
```
return {key: (value if key in ["outcome_id", "author", "publication_date"] else _process_string(value) if isinstance(value, str) else value) for key, value in input_record.items()}
```

Formulating the Query

Construct queries for searching titles and authors in OpenAlex.
Function: get_oa_match(outcome_id: str, title: Union[str, List[str]], chapter_title: str, author: str, publication_date: str, config: Dict[str, str], session: requests.Session) -> List[Dict[str, str]]

Details:

display_titles = title if not chapter_title else chapter_title
mailto = config["mailto"]
candidate_outputs = []
for candidate_title in display_titles:
 query = f"{candidate_title}"
 url = f"https://api.openalex.org/works?filter=title.search:{query}&mailto={mailto}&per-page=25"
 max_retries = 5
 attempts = 0
 success = False
 while attempts < max_retries and not success:
     attempts += 1
     try:
         response = session.get(url, timeout=20)
         data = response.json()
         results = data.get("results")
         candidate_outputs.append(results)
         break
     except KeyError as e:
         logging.warning("Missing key: %s", e)
     except Exception as e:
         logging.warning("Error fetching data: %s", e)

Fuzzy Matching Authors

Use fuzzy matching to find authors with similar names.
Function: author_fuzzy_match(author: str, candidate_author: List[Dict[str, Union[str, Dict[str, str]]]])

Details:

candidate_author_name = candidate_author["display_name"]
author = " ".join([word for word in author.split() if len(word) > 1])
fuzzy_score = fuzz.token_set_ratio(author.lower(), candidate_author_name.lower())
if fuzzy_score >= 75:
 return candidate_author
return None

Filtering Candidates by Author and Date

Filter the candidates to retain those with matching authors and publication dates within a 2-year difference.
Function: get_oa_match(outcome_id: str, title: Union[str, List[str]], chapter_title: str, author: str, publication_date: str, config: Dict[str, str], session: requests.Session) -> List[Dict[str, str]]

Details:

matching_author = []
for candidate_output in candidate_flat:
 authorships = candidate_output["authorships"]
 for authorship in authorships:
     candidate_author = authorship["author"]
     matched_author = author_fuzzy_match(author, candidate_author)
     if matched_author:
         matching_author.append(candidate_output)
matching_date = []
if matching_author and publication_date:
 for candidate_output in matching_author:
     publication_year = candidate_output["publication_year"]
     year_diff = abs(int(publication_date[:4]) - int

Key Utilities and Functions

OpenAlex Utilities (oa.py)
- Functions for fetching and processing data from OpenAlex.
- Include _revert_abstract_index, _parse_results, preprocess_ids, _chunk_oa_ids, _works_generator, fetch_papers_for_id, and json_loader.
Crossref Utilities (cr.py)
- Functions for reverse lookup and matching using Crossref.
- Include _process_item, clean_html_entities, _select_best_match, setup_session, and get_doi.
OpenAlex Matching Utilities (oa_match.py)
- Functions for reverse lookup and matching using OpenAlex.
- Include _process_string, clean_html_entities_for_oa, get_oa_match, and author_fuzzy_match.

Current results:	title_cr_null	Non-Null	Null	Total
Non-Null	15199	44809	60008
Null	61840	197841	259681
Total	77039	242650	319689

Success rates, by publication type:	type	False	True
Book	0.616126	0.383874
Book Chapter	0.697857	0.302143
Book edited	0.754649	0.245351
Conference Proceeding	1	0
Conference/Paper/Proceeding/Abstract	0.763496	0.236504
Consultancy Report	0.89703	0.10297
Data Set	0.571429	0.428571
Journal	1	0
Journal Article/Review	0.477465	0.522535
Manual/Guide	0.933628	0.0663717
Monograph	0.80241	0.19759
Other	0.588269	0.411731
Policy briefing/Report	0.896172	0.103828
Preprint	0.501253	0.498747
Report	0.828767	0.171233
Scholarly edition	0.935252	0.0647482
Systematic review	0.768519	0.231481
Technical Report	0.8958	0.1042
Technical Standard	0.884211	0.115789
Thesis	0.841305	0.158695
Working Paper	0.71762	0.28238
journal-issue	0.333333	0.666667
patent	1	0
All	0.618711	0.381289

Instances where both datasets match on something:	title_cr_eq_title_oa	count
Equal	13158
Not Equal	2041

Examples of matching:		title_gtr_cr	title_cr	title_oa	doi_cr	doi_oa	type	author_gtr
82004	Nudge Theory and Social Innovation: An Analysis of Citizen and Government Initiatives during Covid-19 outbreak in Malaysia.	Nudge Theory and Social Innovation: An analysis of citizen and government initiatives during Covid-19 outbreak in Malaysia	Nudge Theory and Social Innovation: An analysis of citizen and government initiatives during Covid-19 outbreak in Malaysia	10.1109/r10-htc49770.2020.9357050	10.1109/r10-htc49770.2020.9357050	Conference/Paper/Proceeding/Abstract	Minoi J	Minoi, Jacey-Lynn
218716	Null tests of the concordance model in the era of Euclid and the SKA	Null tests of the concordance model in the era of Euclid and the SKA	Null tests of the concordance model in the era of Euclid and the SKA	10.1016/j.dark.2021.100856	10.1016/j.dark.2021.100856	Journal Article/Review	Bengaly Carlos A. P.	Bengaly, Carlos A.P.
107482	Quenching star formation with quasar outflows launched by trapped IR radiation	Quenching star formation with quasar outflows launched by trapped IR radiation	Quenching star formation with quasar outflows launched by trapped IR radiation	10.1093/mnras/sty1514	10.1093/mnras/sty1514	Other	Costa T	Costa, Tiago
73003	Influence of Twin Boundaries and Sample Dimensions on the Mechanical Behavior of Ag Nanowires	Influence of twin boundaries and sample dimensions on the mechanical behavior of Ag nanowires	Influence of twin boundaries and sample dimensions on the mechanical behavior of Ag nanowires	10.1016/j.msea.2021.142150	10.1016/j.msea.2021.142150	Journal Article/Review	Zhao H	Zhao, Hu
18849	Accumulation of Deep Traps at Grain Boundaries in Halide Perovskites	Accumulation of Deep Traps at Grain Boundaries in Halide Perovskites	Accumulation of Deep Traps at Grain Boundaries in Halide Perovskites	10.1021/acsenergylett.9b00840	10.26434/chemrxiv.8058413.v1	Journal Article/Review	Park J	Park, Ji-Sang

Examples of different matches:		title_gtr_cr	title_cr	title_oa	doi_cr	doi_oa	type	author_gtr
21901	Integrating the Use of Official Statistics into Mainstream Curricula via Data Visualisation	Integrating the use of official statistics into mainstream curricula via data visualisation	INTEGRATING THE USE OF OFFICIAL STATISTICS INTO MAINSTREAM CURRICULA VIA DATA VISUALISATION	10.52041/srap.13602	nan	Conference/Paper/Proceeding/Abstract	Nicholson, J.	Nicholson, James
67252	Investigation of the suitability of decellularized porcine pericardium in mitral valve reconstruction.	Investigation of the Suitability of Decellularised Porcine Pericardium for Mitral Valve Reconstruction	Investigation of the suitability of decellularized porcine pericardium in mitral valve reconstruction.	10.5339/qproc.2012.heartvalve.4.39	nan	Journal Article/Review	Morticelli L	Morticelli, Lucrezia
191593	Trading at the Speed of Light: How Ultrafast Algorithms Are Transforming Financial Markets	Donald MACKENZIE, Trading at the Speed of Light. How Ultrafast Algorithms Are Transforming Financial Markets , Princeton, Princeton University Press, 2021, 304 p.	Trading at the Speed of Light: How Ultrafast Algorithms Are Transforming Financial Markets	10.3917/res.234.0237	nan	Book	MacKenzie Donald	Duterme, Tom
289901	The Oxford Domed Lateral Implant: Increasing tibial component wall height reduces the risk of medial dislocation of the mobile bearing	The Oxford Domed Lateral Unicompartmental Knee Replacement implant: Increasing wall height reduces the risk of bearing dislocation	The Oxford domed lateral implant: Increasing tibial component wall height reduces the risk of medial dislocation of the mobile bearing	10.1177/09544119211048558	nan	Conference/Paper/Proceeding/Abstract	Yang I	Yang, Irene
28714	LoCuSS: Exploring the selection of faint blue background galaxies for cluster weak-lensing	LoCuSS: exploring the selection of faint blue background galaxies for cluster weak-lensing	LoCuSS: Testing hydrostatic equilibrium in galaxy clusters	10.1093/mnras/stw2192	10.1093/mnrasl/slv175	Journal Article/Review	Ziparo Felicia	Ziparo, Felicia

innovation-growth-lab / dsit-impact