ror-community / ror-roadmap

Central information about what is happening at ROR and how to contribute feedback
10 stars 2 forks source link

[BUG] Inconsistency between matching types/scores in V1 and V2 affiliation matching #243

Open adambuttrick opened 6 months ago

adambuttrick commented 6 months ago

Version v2

Describe the bug Searching for the same string in the V1 and V2 affiliation endpoints can returns different match types and scores, resulting in successful matches in V1 and failed matches in V2.

To Reproduce Steps to reproduce the behavior:

  1. Search for "University of Southern California, USA" in V2 of the affiliation endpoint - https://api.ror.org/v2/organizations?affiliation=University+of+Southern+California%2C+USA
  2. https://ror.org/03taz7m60 - University of Southern California is the first result where matching_type="PHRASE" and score =1. No other results score the same or higher.
  3. Chosen is false
  4. Search for "University of Southern California, USA" in V1 of the affiliation endpoint - https://api.ror.org/v1/organizations?affiliation=University+of+Southern+California%2C+USA
  5. https://ror.org/03taz7m60 - University of Southern California is the second result where matching_type="COMMON TERMS" and score is 0.94. 1 result scores higher (0.95 with matching_type="COMMON TERMS")
  6. Chosen is true

Expected behavior V2 affiliation should return the same matches as V1 where the inputs are the same, as the matching logic is shared. Unclear whether scoring and match types should be the same relative to indexing.

lizkrznarich commented 4 months ago

Dev note: It's possible that this is due to difference in the content of v1 vs v2 records and is therefore not really a bug. Needs investigation.

ashwinisukale commented 2 months ago

We index same data differently for both versions, V1 template and V2 template.

Following factors might affect the returned result and scoring

Field Differences:

In v1, fields like name.norm, aliases.norm, and labels.label.norm are used, whereas in v2, the field names.value.norm is used. If the same data is indexed in both versions but accessed differently in queries, this can result in different matches and scores.

Scoring Mechanism:

The function get_score calculates similarity scores differently for v1 and v2. Specifically, it accesses different fields (candidate.country.country_code for v1 and candidate.locations[0].geonames_details.country_code for v2). Even if the same data is indexed, variations in scoring logic or the fields being checked could lead to divergent results.

Country Matching:

the way countries are grouped or mapped to regions may change between v1 and v2, leading to different matching results when handling country-related data.

v1: Focuses on broad geographic or historical groupings. The US might be grouped with only North American countries (e.g., Canada and Mexico).

v2: Focuses on specific economic or trade-related groupings. The US might be treated differently, perhaps being grouped with countries based on economic regions (e.g., G7 nations).

ashwinisukale commented 2 months ago

@lizkrznarich Could you please verify my findings if that make sense for the current issue.

lizkrznarich commented 2 months ago

@ashwinisukale Thanks! I'm not sure this gets at the fundamental differences between v1 and v2 results in ES queries because the data in v2 names and v1 name, aliases and labels is the same for every record (we crosswalk between the 2 and update both versions on each release). Similarly, the value in candidate.country.country_code is the same as the value in candidate.locations[0].geonames_details.country_code for every record.

What's different in v2 is that there a fewer fields and data is not repeated across different fields as much as v1, which I suspect may result in a slightly different set of possible matches, matching types and scores even though the values of the relevant fields are the same. I'm wondering whether a COMMON TERMS match type is much more likely in v1 because name variants appear in 4 fields rather than 1. For example, "California" will be found in name and labels in https://api.ror.org/v1/organizations/03taz7m60 but only in names in https://api.ror.org/v2/organizations/03taz7m60 (we do not include labels from relationships fields in queries).

Assuming the above could be the case, I think differences in scores in and matching types are acceptable. What I think needs more investigation is why chosen = false for the v2 result in the example https://api.ror.org/v2/organizations?affiliation=University+of+Southern+California%2C+USA when it had a score of 1 and no other closely matched results.

adambuttrick commented 2 months ago

I took a look at this again and I think this is partially a result of the the logic in the get_output function. If I'm parsing correctly relative to the v2 results and matching logic:

  1. For the substring, "University of Southern California", there are two matches: a. A PHRASE match with score 1.0 (https://ror.org/03taz7m60 - Correct Match) b. A COMMON TERMS match with score 0.95 (https://ror.org/058zz0t50 - Has the alias California Southern University, i.e. all the same terms, minus "of", as the input in a different order)

  2. get_output tries to select the best match from these two:

    • PHRASE match (score 1.0, https://ror.org/03taz7m60) is initially set as the best match.
    • COMMON TERMS match (score 0.95, https://ror.org/058zz0t50) is compared but doesn't replace it because:
      • Lower score (0.95 < 1.0)
      • Even if scores were equal, PHRASE has higher priority in type_map (4 > 3)
  3. BUT thechosen flag is only set to True if the match has score 1.0 and the match type is EXACT

  4. This condition is not met for the PHRASE match because it's the wrong type (PHRASE vs. EXACT), so despite being the best match and having a score 1.0, it's not marked chosen=True

So, I think this at least explains why the high-scoring (1.0) result with chosen=false. The get_output function lacks a mechanism to automatically mark high-scoring non-EXACT matches as chosen. Whether we should add this logic would need to be regression tested against the Marple affiliation test datasets.

What's not clear is what in the indexing or otherwise is causing the record for the correct result (https://ror.org/03taz7m60) to be a COMMON TERMS in v1 vs. PHRASE match in v2. I'm uncertain what impact this would have on the final chosen match though, because it's not clear what the scores would be in v2 with the different match type.

amandafrench commented 3 weeks ago

Draft of proposal to add country subdivision fields to v2.1 of the ROR schema: https://bit.ly/ror-schema-2-1-proposal