Open adambuttrick opened 6 months ago
Dev note: It's possible that this is due to difference in the content of v1 vs v2 records and is therefore not really a bug. Needs investigation.
We index same data differently for both versions, V1 template and V2 template.
Following factors might affect the returned result and scoring
Field Differences:
In v1, fields like name.norm
, aliases.norm
, and labels.label.norm
are used, whereas in v2, the field names.value.norm
is used. If the same data is indexed in both versions but accessed differently in queries, this can result in different matches and scores.
Scoring Mechanism:
The function get_score
calculates similarity scores differently for v1 and v2. Specifically, it accesses different fields (candidate.country.country_code
for v1 and candidate.locations[0].geonames_details.country_code
for v2). Even if the same data is indexed, variations in scoring logic or the fields being checked could lead to divergent results.
Country Matching:
the way countries are grouped or mapped to regions may change between v1 and v2, leading to different matching results when handling country-related data.
v1: Focuses on broad geographic or historical groupings. The US might be grouped with only North American countries (e.g., Canada and Mexico).
v2: Focuses on specific economic or trade-related groupings. The US might be treated differently, perhaps being grouped with countries based on economic regions (e.g., G7 nations).
@lizkrznarich Could you please verify my findings if that make sense for the current issue.
@ashwinisukale Thanks! I'm not sure this gets at the fundamental differences between v1 and v2 results in ES queries because the data in v2 names and v1 name, aliases and labels is the same for every record (we crosswalk between the 2 and update both versions on each release). Similarly, the value in candidate.country.country_code
is the same as the value in candidate.locations[0].geonames_details.country_code
for every record.
What's different in v2 is that there a fewer fields and data is not repeated across different fields as much as v1, which I suspect may result in a slightly different set of possible matches, matching types and scores even though the values of the relevant fields are the same. I'm wondering whether a COMMON TERMS match type is much more likely in v1 because name variants appear in 4 fields rather than 1. For example, "California" will be found in name and labels in https://api.ror.org/v1/organizations/03taz7m60 but only in names in https://api.ror.org/v2/organizations/03taz7m60 (we do not include labels from relationships fields in queries).
Assuming the above could be the case, I think differences in scores in and matching types are acceptable. What I think needs more investigation is why chosen = false for the v2 result in the example https://api.ror.org/v2/organizations?affiliation=University+of+Southern+California%2C+USA when it had a score of 1 and no other closely matched results.
I took a look at this again and I think this is partially a result of the the logic in the get_output
function. If I'm parsing correctly relative to the v2 results and matching logic:
For the substring, "University of Southern California", there are two matches: a. A PHRASE match with score 1.0 (https://ror.org/03taz7m60 - Correct Match) b. A COMMON TERMS match with score 0.95 (https://ror.org/058zz0t50 - Has the alias California Southern University, i.e. all the same terms, minus "of", as the input in a different order)
get_output
tries to select the best match from these two:
type_map
(4 > 3)BUT thechosen
flag is only set to True
if the match has score 1.0 and the match type is EXACT
This condition is not met for the PHRASE match because it's the wrong type (PHRASE vs. EXACT), so despite being the best match and having a score 1.0, it's not marked chosen=True
So, I think this at least explains why the high-scoring (1.0) result with chosen=false
. The get_output
function lacks a mechanism to automatically mark high-scoring non-EXACT matches as chosen. Whether we should add this logic would need to be regression tested against the Marple affiliation test datasets.
What's not clear is what in the indexing or otherwise is causing the record for the correct result (https://ror.org/03taz7m60) to be a COMMON TERMS in v1 vs. PHRASE match in v2. I'm uncertain what impact this would have on the final chosen match though, because it's not clear what the scores would be in v2 with the different match type.
Draft of proposal to add country subdivision fields to v2.1 of the ROR schema: https://bit.ly/ror-schema-2-1-proposal
Version v2
Describe the bug Searching for the same string in the V1 and V2 affiliation endpoints can returns different match types and scores, resulting in successful matches in V1 and failed matches in V2.
To Reproduce Steps to reproduce the behavior:
Expected behavior V2 affiliation should return the same matches as V1 where the inputs are the same, as the matching logic is shared. Unclear whether scoring and match types should be the same relative to indexing.