innovation-growth-lab / dsit-impact

Enhancing the impact and team science metrics of UKRI-funded research publications.
0 stars 0 forks source link

1d. Matching to OpenAlex - OA Queries #6

Closed ampudia19 closed 2 months ago

ampudia19 commented 2 months ago

Methodological Readme for Collecting Data from OpenAlex using DOI and Reverse Lookups with Crossref and OpenAlex

Overview

This blob outlines the methodology for collecting data from OpenAlex using DOI, performing reverse lookups to match missing records using Crossref, and using OpenAlex's own matching capabilities through search works. The pipeline is implemented using Kedro, and the nodes and utilities provided are designed to handle the preprocessing, fetching, and matching tasks.

Preprocessing Steps

  1. Preprocess DOIs

    • Extract DOIs from the input data and ensure they are in a standard format using regex.
    • Function: preprocess_publication_doi(df: pd.DataFrame) -> pd.DataFrame
    • Key Decision: Only consider valid DOIs that match the pattern 10\..+.
  2. Create DOI Input List

    • Generate a list of unique DOI values from the preprocessed data.
    • Group DOIs for efficient querying if specified.
    • Function: create_list_doi_inputs(df: pd.DataFrame, **kwargs) -> list
    • Key Decision: Use grouping to reduce the number of API calls (OpenAlex manages up to 50 AND/OR logic statements per call). We also parallelise, with rate limits at 10 requests per second. We use 8 workers to guarantee not hitting the limit.

Fetching Data from OpenAlex

  1. Fetch Papers

    • Use OpenAlex API to fetch papers based on the provided DOIs.
    • Parallelise requests to optimise performance.
    • Function: fetch_papers(ids: Union[List[str], List[List[str]]], mailto: str, perpage: int, filter_criteria: Union[str, List[str]], parallel_jobs: int = 8) -> Dict[str, List[Callable]]
    • Key Decision: Limit chunks to 80 IDs for each API call, as this balances between query complexity and API rate limits.
  2. Concatenate OpenAlex Data

    • Combine the partitioned JSON datasets into a single DataFrame.
    • Function: concatenate_openalex(data: Dict[str, AbstractDataset]) -> pd.DataFrame
    • Key Decision: Efficiently handle and combine large datasets using pandas.

Reverse Lookup using Crossref

Match DOIs with Crossref

Key Functions and Process

  1. Setting Up the Session

    • Establish a session with retry strategies to handle transient errors and rate limits.
    • Function: setup_session()
    • Details:
      retry_strategy = Retry(total=3, status_forcelist=[429, 500, 502, 503, 504], allowed_methods=["HEAD", "GET", "OPTIONS"])
      adapter = HTTPAdapter(max_retries=retry_strategy)
      session = requests.Session()
      session.mount("https://", adapter)
      session.mount("http://", adapter)
  2. Cleaning HTML Entities

    • Clean and preprocess the input records to ensure accurate matching.
    • Function: clean_html_entities(input_record: Dict[str, Union[str, int, float]]) -> Dict[str, Union[str, int, float]]
    • Details:
      return {key: unescape(value.replace("&", "and")) if isinstance(value, str) else value for key, value in input_record.items()}
  3. Formulating the Query

    • Construct a query string from the bibliographic information.
    • Function: get_doi(outcome_id: str, title: str, author: str, journal: str, publication_date: str, mailto: str, session: requests.Session) -> Dict[str, str]
    • Details:
      query = f"{title}, {author}, {journal}, {publication_date}"
      url = f'https://api.crossref.org/works?query.bibliographic="{query}"&mailto={mailto}&rows=5'
  4. Fetching and Processing Results

    • Send the query to the Crossref API and process the returned results.
    • Extract relevant information and calculate fuzzy scores for matching.
    • Function: _process_item(item: Dict[str, Union[str, Dict[str, str]]], title: str, author: str, journal: str, publication_date: str) -> Union[Dict[str, Union[str, int, float]], None]
    • Details:
      cr = {
       "title": item["title"][0],
       "author": f"{item['author'][0]['family']}, {item['author'][0]['given']}",
       "journal": item["container-title"][0],
       "year": year,
       "doi": item["DOI"].lower(),
       "score": item["score"],
       "year_diff": abs(year - int(publication_date[:4])),
      }
      fuzzy_scores = [
       fuzz.token_set_ratio(title.lower(), cr["title"].lower()),
       fuzz.token_set_ratio(author.lower(), cr["author"].lower()),
      ]
      if journal:
       fuzzy_scores.append(fuzz.token_set_ratio(journal, cr["journal"]))
      cr["fuzzy_score"] = sum(fuzzy_scores) / len(fuzzy_scores)
      return cr if cr["year_diff"] <= 1 else None
  5. Selecting the Best Match

    • From the processed results, select the best match based on a composite score.
    • Function: _select_best_match(outcome_id: str, matches: List[Dict[str, Union[str, int, float]]]) -> Union[Dict[str, Union[str, int, float]], None]
    • Details:
      best_match = None
      highest_score = 0
      for match in matches:
       cr_score = match["score"]
       cr_fuzzy_score = match["fuzzy_score"]
       composite_score = cr_score + cr_fuzzy_score
       if all([composite_score > highest_score, cr_score > 60, cr_fuzzy_score > 60]):
           highest_score = composite_score
           match["outcome_id"] = outcome_id
           best_match = match
      return best_match
  6. Batch Processing

    • Process records in batches to handle large datasets efficiently.
    • Function: crossref_doi_match(oa_data: pd.DataFrame, gtr_data: pd.DataFrame, mailto: str) -> Generator[Dict[str, pd.DataFrame], None, None]
    • Details:
      unmatched_data = gtr_data[~gtr_data["doi"].isin(oa_data["doi"])]
      inputs = unmatched_data[["outcome_id", "title", "author", "journal_title", "publication_date"]].to_dict(orient="records")
      cleaned_inputs = [clean_html_entities(record) for record in inputs]
      input_batches = [cleaned_inputs[i: i + 250] for i in range(0, len(cleaned_inputs), 250)]
      for i, batch in enumerate(input_batches):
       session = setup_session()
       results = Parallel(n_jobs=4, verbose=10)(delayed(get_doi)(x["outcome_id"], x["title"], x["author"], x["journal_title"], x["publication_date"], mailto, session) for x in batch)
       results = [r for r in results if r]
       df = pd.DataFrame(results)
       if df.empty:
           continue
       df = df.merge(pd.DataFrame(batch), on="outcome_id", how="right", suffixes=("_cr", "_gtr"))
       yield {f"s{i}": df}

Reverse Lookup using OpenAlex

Search and Match in OpenAlex

Key Functions and Process

  1. Cleaning HTML Entities for OpenAlex

    • Clean and preprocess input records for OpenAlex searches.
    • Function: clean_html_entities_for_oa(input_record: Dict[str, Union[str, int, float]]) -> Dict[str, Union[str, int, float, Dict[str, str]]]
    • Details:
      return {key: (value if key in ["outcome_id", "author", "publication_date"] else _process_string(value) if isinstance(value, str) else value) for key, value in input_record.items()}
  2. Formulating the Query

    • Construct queries for searching titles and authors in OpenAlex.
    • Function: get_oa_match(outcome_id: str, title: Union[str, List[str]], chapter_title: str, author: str, publication_date: str, config: Dict[str, str], session: requests.Session) -> List[Dict[str, str]]
    • Details:
      display_titles = title if not chapter_title else chapter_title
      mailto = config["mailto"]
      candidate_outputs = []
      for candidate_title in display_titles:
       query = f"{candidate_title}"
       url = f"https://api.openalex.org/works?filter=title.search:{query}&mailto={mailto}&per-page=25"
       max_retries = 5
       attempts = 0
       success = False
       while attempts < max_retries and not success:
           attempts += 1
           try:
               response = session.get(url, timeout=20)
               data = response.json()
               results = data.get("results")
               candidate_outputs.append(results)
               break
           except KeyError as e:
               logging.warning("Missing key: %s", e)
           except Exception as e:
               logging.warning("Error fetching data: %s", e)
  3. Fuzzy Matching Authors

    • Use fuzzy matching to find authors with similar names.
    • Function: author_fuzzy_match(author: str, candidate_author: List[Dict[str, Union[str, Dict[str, str]]]])
    • Details:
      candidate_author_name = candidate_author["display_name"]
      author = " ".join([word for word in author.split() if len(word) > 1])
      fuzzy_score = fuzz.token_set_ratio(author.lower(), candidate_author_name.lower())
      if fuzzy_score >= 75:
       return candidate_author
      return None
  4. Filtering Candidates by Author and Date

    • Filter the candidates to retain those with matching authors and publication dates within a 2-year difference.
    • Function: get_oa_match(outcome_id: str, title: Union[str, List[str]], chapter_title: str, author: str, publication_date: str, config: Dict[str, str], session: requests.Session) -> List[Dict[str, str]]
    • Details:
      matching_author = []
      for candidate_output in candidate_flat:
       authorships = candidate_output["authorships"]
       for authorship in authorships:
           candidate_author = authorship["author"]
           matched_author = author_fuzzy_match(author, candidate_author)
           if matched_author:
               matching_author.append(candidate_output)
      matching_date = []
      if matching_author and publication_date:
       for candidate_output in matching_author:
           publication_year = candidate_output["publication_year"]
           year_diff = abs(int(publication_date[:4]) - int

      Key Utilities and Functions

  5. OpenAlex Utilities (oa.py)

    • Functions for fetching and processing data from OpenAlex.
    • Include _revert_abstract_index, _parse_results, preprocess_ids, _chunk_oa_ids, _works_generator, fetch_papers_for_id, and json_loader.
  6. Crossref Utilities (cr.py)

    • Functions for reverse lookup and matching using Crossref.
    • Include _process_item, clean_html_entities, _select_best_match, setup_session, and get_doi.
  7. OpenAlex Matching Utilities (oa_match.py)

    • Functions for reverse lookup and matching using OpenAlex.
    • Include _process_string, clean_html_entities_for_oa, get_oa_match, and author_fuzzy_match.
ampudia19 commented 2 months ago
Current results: title_cr_null Non-Null Null Total
Non-Null 15199 44809 60008
Null 61840 197841 259681
Total 77039 242650 319689
Success rates, by publication type: type False True
Book 0.616126 0.383874
Book Chapter 0.697857 0.302143
Book edited 0.754649 0.245351
Conference Proceeding 1 0
Conference/Paper/Proceeding/Abstract 0.763496 0.236504
Consultancy Report 0.89703 0.10297
Data Set 0.571429 0.428571
Journal 1 0
Journal Article/Review 0.477465 0.522535
Manual/Guide 0.933628 0.0663717
Monograph 0.80241 0.19759
Other 0.588269 0.411731
Policy briefing/Report 0.896172 0.103828
Preprint 0.501253 0.498747
Report 0.828767 0.171233
Scholarly edition 0.935252 0.0647482
Systematic review 0.768519 0.231481
Technical Report 0.8958 0.1042
Technical Standard 0.884211 0.115789
Thesis 0.841305 0.158695
Working Paper 0.71762 0.28238
journal-issue 0.333333 0.666667
patent 1 0
All 0.618711 0.381289
ampudia19 commented 2 months ago
Instances where both datasets match on something: title_cr_eq_title_oa count
Equal 13158
Not Equal 2041
Examples of matching: title_gtr_cr title_cr title_oa doi_cr doi_oa type author_gtr author_cr
82004 Nudge Theory and Social Innovation: An Analysis of Citizen and Government Initiatives during Covid-19 outbreak in Malaysia. Nudge Theory and Social Innovation: An analysis of citizen and government initiatives during Covid-19 outbreak in Malaysia Nudge Theory and Social Innovation: An analysis of citizen and government initiatives during Covid-19 outbreak in Malaysia 10.1109/r10-htc49770.2020.9357050 10.1109/r10-htc49770.2020.9357050 Conference/Paper/Proceeding/Abstract Minoi J Minoi, Jacey-Lynn
218716 Null tests of the concordance model in the era of Euclid and the SKA Null tests of the concordance model in the era of Euclid and the SKA Null tests of the concordance model in the era of Euclid and the SKA 10.1016/j.dark.2021.100856 10.1016/j.dark.2021.100856 Journal Article/Review Bengaly Carlos A. P. Bengaly, Carlos A.P.
107482 Quenching star formation with quasar outflows launched by trapped IR radiation Quenching star formation with quasar outflows launched by trapped IR radiation Quenching star formation with quasar outflows launched by trapped IR radiation 10.1093/mnras/sty1514 10.1093/mnras/sty1514 Other Costa T Costa, Tiago
73003 Influence of Twin Boundaries and Sample Dimensions on the Mechanical Behavior of Ag Nanowires Influence of twin boundaries and sample dimensions on the mechanical behavior of Ag nanowires Influence of twin boundaries and sample dimensions on the mechanical behavior of Ag nanowires 10.1016/j.msea.2021.142150 10.1016/j.msea.2021.142150 Journal Article/Review Zhao H Zhao, Hu
18849 Accumulation of Deep Traps at Grain Boundaries in Halide Perovskites Accumulation of Deep Traps at Grain Boundaries in Halide Perovskites Accumulation of Deep Traps at Grain Boundaries in Halide Perovskites 10.1021/acsenergylett.9b00840 10.26434/chemrxiv.8058413.v1 Journal Article/Review Park J Park, Ji-Sang
Examples of different matches: title_gtr_cr title_cr title_oa doi_cr doi_oa type author_gtr author_cr
21901 Integrating the Use of Official Statistics into Mainstream Curricula via Data Visualisation Integrating the use of official statistics into mainstream curricula via data visualisation INTEGRATING THE USE OF OFFICIAL STATISTICS INTO MAINSTREAM CURRICULA VIA DATA VISUALISATION 10.52041/srap.13602 nan Conference/Paper/Proceeding/Abstract Nicholson, J. Nicholson, James
67252 Investigation of the suitability of decellularized porcine pericardium in mitral valve reconstruction. Investigation of the Suitability of Decellularised Porcine Pericardium for Mitral Valve Reconstruction Investigation of the suitability of decellularized porcine pericardium in mitral valve reconstruction. 10.5339/qproc.2012.heartvalve.4.39 nan Journal Article/Review Morticelli L Morticelli, Lucrezia
191593 Trading at the Speed of Light: How Ultrafast Algorithms Are Transforming Financial Markets Donald MACKENZIE, Trading at the Speed of Light. How Ultrafast Algorithms Are Transforming Financial Markets , Princeton, Princeton University Press, 2021, 304 p. Trading at the Speed of Light: How Ultrafast Algorithms Are Transforming Financial Markets 10.3917/res.234.0237 nan Book MacKenzie Donald Duterme, Tom
289901 The Oxford Domed Lateral Implant: Increasing tibial component wall height reduces the risk of medial dislocation of the mobile bearing The Oxford Domed Lateral Unicompartmental Knee Replacement implant: Increasing wall height reduces the risk of bearing dislocation The Oxford domed lateral implant: Increasing tibial component wall height reduces the risk of medial dislocation of the mobile bearing 10.1177/09544119211048558 nan Conference/Paper/Proceeding/Abstract Yang I Yang, Irene
28714 LoCuSS: Exploring the selection of faint blue background galaxies for cluster weak-lensing LoCuSS: exploring the selection of faint blue background galaxies for cluster weak-lensing LoCuSS: Testing hydrostatic equilibrium in galaxy clusters 10.1093/mnras/stw2192 10.1093/mnrasl/slv175 Journal Article/Review Ziparo Felicia Ziparo, Felicia