ror-community / ror-roadmap

Central information about what is happening at ROR and how to contribute feedback
10 stars 2 forks source link

[BUG] Exact matching failing for some records in affiliation endpoint #238

Open adambuttrick opened 7 months ago

adambuttrick commented 7 months ago

Version v1, v2

Describe the bug In the affiliation endpoint, when using as input the primary name of an organization, some names fail to return an exact match. This does not appear specific to records in a single release, as examples (all from v1.45) below demonstrates.

To Reproduce Steps to reproduce the behavior:

Example 1:

  1. Query the affiliation endpoint using the display name for https://ror.org/02xhx4j26 - Centre for Marine Socioecology in v1 and v2: https://api.ror.org/v1/organizations?affiliation=Centre%20for%20Marine%20Socioecology https://api.ror.org/v2/organizations?affiliation=Centre%20for%20Marine%20Socioecology
  2. 0 results are returned in v1 and v2

Example 2:

  1. Query the affiliation endpoint using the display name for https://ror.org/00arpt780 - Institute for Marine and Antarctic Studies in v1 and v2: https://api.ror.org/v1/organizations?affiliation=Institute%20for%20Marine%20and%20Antarctic%20Studies https://api.ror.org/v2/organizations?affiliation=Institute%20for%20Marine%20and%20Antarctic%20Studies

  2. 1 results is returned in v1 and 2 results in v2, with no chosen=True in either.

Example 3:

  1. Query the affiliation endpoint using the display name for https://ror.org/026xeq875 - Tasmanian Behavioural Lab in v1 and v2: https://api.ror.org/v1/organizations?affiliation=Tasmanian%20Behavioural%20Lab https://api.ror.org/v2/organizations?affiliation=Tasmanian%20Behavioural%20Lab

  2. 1 results is returned in v1 and in v2, both with chosen=True for the ROR ID

Expected behavior Exact matches on all records where the ROR display or any other names on the record are used.

adambuttrick commented 7 months ago

@dtkaczyk took a look at this and this appears to be the result of our logic that maps US states names to the US as a country. Since the similarity between “marine” and “Maine” is above the matching score (https://github.com/ror-community/ror-api/blob/94bad807f188f22c80a3234fcbf7b48e52f01818/rorapi/common/matching.py#L111), the strategy recognizes this affiliation as being from the US. Since the country for the correct records doesn’t match (all examples are from Australia), they're rejected. We should perhaps adjust the state matching logic to be exact vs. fuzzy to account for this.

adambuttrick commented 7 months ago

Adding word boundaries matchers around the name value appears to fix this, e.g.:

def get_country_codes(string):
    """Extract the country codes from the string,  
    if the country names are mentioned."""

    string = unidecode.unidecode(string).strip()
    lower = re.sub(r"\s+", " ", string.lower()) 
    lower_alpha = re.sub(r"\s+", " ", re.sub("[^a-z]", " ", string.lower()))
    alpha = re.sub(r"\s+", " ", re.sub("[^a-zA-Z]", " ", string))
    codes = []
    for code, name in COUNTRIES:
        if re.search("[^a-z]", name):
            score = fuzz.partial_ratio(name, lower)
        elif len(name) == 2:  
            score = max([fuzz.ratio(name.upper(), t) for t in alpha.split()])
        else:
            # Add word boundary matchers around the name 
            regex = r'\b' + re.escape(name) + r'\b'
            score = 100 if re.search(regex, lower_alpha) else 0
        if score >= 90:
            codes.append(code.upper())
    return list(set(codes))
lizkrznarich commented 4 months ago

@adambuttrick I think improving country extraction is a good idea generally, but for this case should we even care about country extraction in for the exact match type? We do not have to include that piece for the exact match type and it might be better not to include any additional magic other than matching the input string to names on the ROR record.

adambuttrick commented 4 months ago

This would be fine if exact match precedes any form of country identification or substring parsing, meaning that the full affiliation string input has an exact match. I think we would still run into the issue of certain terms like "Maine" being parsed as a US state (and thus US as the country) and that conflicting with/creating fuzziness with other countries represented in the input when doing substring parsing, but correct me if I'm mistaken here.