openaire / iis

Information Inference Service of the OpenAIRE system
Apache License 2.0
19 stars 11 forks source link

Deficient affiliation matched to organization has high match strength #1245

Closed przemyslawjacewicz closed 3 years ago

przemyslawjacewicz commented 3 years ago

This task is a subtask of parent task #1129 .

A deficient affiliation extracted by CERMINE for article with id 50|dedup_wf_001::6e9c8dd0ff2c33c49a207b43a3bdcae6 is matched to organization with high match strength.

The article has 3 affiliations extracted by CERMINE:

[
  {
    "organization": "Department of Medical Oncology, The s econd a ffiliated h ospital of Zhejiang University s chool of Medicine",
    "countryname": null,
    "countrycode": null,
    "address": "h angzhou, Zhejiang Province, People's republic of china",
    "rawtext": "Department of Medical Oncology, The s econd a ffiliated h ospital of Zhejiang University s chool of Medicine, h angzhou, Zhejiang Province, People's republic of china"
  },
  {
    "organization": "cancer institute, Key laboratory of cancer Prevention and intervention, chinese n ational Ministry of education",
    "countryname": null,
    "countrycode": null,
    "address": "h angzhou, Zhejiang Province, People's republic of china",
    "rawtext": "cancer institute, Key laboratory of cancer Prevention and intervention, chinese n ational Ministry of education, h angzhou, Zhejiang Province, People's republic of china"
  },
  {
    "organization": "nlr",
    "countryname": null,
    "countrycode": null,
    "address": "Mlr, WBc, neutrophil, monocyte",
    "rawtext": "nlr, Mlr, WBc, neutrophil, monocyte"
  }
]

Affiliation number 3 is clearly deficient - organization field holds only an abbreviation and other fields hold data not related to any metadata concerning affiliation. Yet affiliation matching module matches this document with organizations with high match strength, what can be seen in this table table showing a join of affiliation matching results with organizations:

+--------------------+--------------------+-------------+--------------------+--------------------+---------+------------------+-----------+--------------------+
|          documentId|      organizationId|matchStrength|                  id|                name|shortName|       countryName|countryCode|          websiteUrl|
+--------------------+--------------------+-------------+--------------------+--------------------+---------+------------------+-----------+--------------------+
|50|dedup_wf_001::...|20|dedup_wf_001::...|        0.962|20|dedup_wf_001::...|STICHTING NATIONA...|      NLR|       Netherlands|         NL|   http://www.nlr.nl|
|50|dedup_wf_001::...|20|opendoar____::...|        0.962|20|opendoar____::...|Netherlands Aeros...|      NLR|       Netherlands|         NL|  http://www.nlr.nl/|
|50|dedup_wf_001::...|20|dedup_wf_001::...|        0.962|20|dedup_wf_001::...|National Library ...|      NLR|Russian Federation|         RU|http://www.nlr.ru...|
|50|dedup_wf_001::...|20|dedup_wf_001::...|   0.99623805|20|dedup_wf_001::...|    Cancer Institute|      WIA|             India|         IN|http://cancerinst...|
|50|dedup_wf_001::...|20|dedup_wf_001::...|        0.962|20|dedup_wf_001::...|Norsk Landbruksrå...|      NLR|            Norway|         NO|  http://www.nlr.no/|
|50|dedup_wf_001::...|20|dedup_wf_001::...|        0.962|20|dedup_wf_001::...|National Aerospac...|      NLR|       Netherlands|         NL|  http://www.nlr.nl/|
|50|dedup_wf_001::...|20|dedup_wf_001::...|        0.962|20|dedup_wf_001::...|North Little Rock...|      NLR|     United States|         US|  http://nlr.ar.gov/|
|50|dedup_wf_001::...|20|dedup_wf_001::...|        0.962|20|dedup_wf_001::...|Netherlands Lepro...|      NLR|       Netherlands|         NL|http://leprosyrel...|
+--------------------+--------------------+-------------+--------------------+--------------------+---------+------------------+-----------+--------------------+

We should check why a deficient affiliation such as the one above when matched to organization has high match strength. To do that we should check the training set for any similar cases that is matches between affiliations and organizations when only abbreviations match and see how this influences the value of match strength associated to the match.

A possible outcome of this task can be a change in the value of match strength associated to matches between similar affiliations and organizations. This however must be done with caution not to remove any valid matches. It is also possible that this task will not result in any changes because any change will result in high number of removed valid matches. The result can then be an implementation change that is task #1244 .

Zeppelin note with data for this task is https://iis-cdh5-test-gw.ocean.icm.edu.pl/zeppelin/#/notebook/2G2QUYZRF .

przemyslawjacewicz commented 3 years ago

Adding another example of high match strength for affiliation that is generic. Article with id 50|dedup_wf_001::52311b1d779a5c4224997fbe0ceb3d52 has 5 affiliations:

{
  "affiliations": [
    {
      "organization": "Athena Institute, Faculty of Science, VU University Amsterdam",
      "countryName": "Netherlands",
      "countryCode": "NL",
      "address": "Amsterdam",
      "rawText": "Athena Institute, Faculty of Science, VU University Amsterdam, Amsterdam, Netherlands"
    },
    {
      "organization": "Facultad Ciencias de la Salud, Universidad Metropolitana",
      "countryName": "Colombia",
      "countryCode": "CO",
      "address": "Barranquilla",
      "rawText": "Facultad Ciencias de la Salud, Universidad Metropolitana, Barranquilla, Colombia"
    },
    {
      "organization": "America de Sur, DAHW Deutsche Lepra- und Tuberkulosehilfe",
      "countryName": "Colombia",
      "countryCode": "CO",
      "address": "Bogota",
      "rawText": "America de Sur, DAHW Deutsche Lepra- und Tuberkulosehilfe, Bogota, Colombia"
    },
    {
      "organization": "NLR",
      "countryName": "Netherlands",
      "countryCode": "NL",
      "address": "Amsterdam",
      "rawText": "NLR, Amsterdam, Netherlands"
    },
    {
      "organization": "Sciensano",
      "countryName": "BELGIUM",
      "countryCode": "BE",
      "rawText": "Sciensano, BELGIUM"
    }
  ]
}

NLR organization is generic yet it is matched to 4 organizations with NLR short name with highest match strength:

+--------------------+--------------------+-------------+--------------------+--------------------+--------------------+-----------+-----------+--------------------+
|          documentId|      organizationId|matchStrength|                  id|                name|           shortName|countryName|countryCode|          websiteUrl|
+--------------------+--------------------+-------------+--------------------+--------------------+--------------------+-----------+-----------+--------------------+
|50|dedup_wf_001::...|20|dedup_wf_001::...|          1.0|20|dedup_wf_001::...|STICHTING NATIONA...|                 NLR|Netherlands|         NL|   http://www.nlr.nl|
|50|dedup_wf_001::...|20|opendoar____::...|          1.0|20|opendoar____::...|Netherlands Aeros...|                 NLR|Netherlands|         NL|  http://www.nlr.nl/|
|50|dedup_wf_001::...|20|dedup_wf_001::...|          1.0|20|dedup_wf_001::...|National Aerospac...|                 NLR|Netherlands|         NL|  http://www.nlr.nl/|
|50|dedup_wf_001::...|20|grid________::...|    0.9999871|20|grid________::...|Universidad Metro...|Universidad Metro...|   Colombia|         CO|                    |
|50|dedup_wf_001::...|20|dedup_wf_001::...|          1.0|20|dedup_wf_001::...|Netherlands Lepro...|                 NLR|Netherlands|         NL|http://leprosyrel...|
+--------------------+--------------------+-------------+--------------------+--------------------+--------------------+-----------+-----------+--------------------+
przemyslawjacewicz commented 3 years ago

For reference: the initial case is going to be added to quality test set of affiliation matching. This will lower the match strength of matching. The second case is different because the affiliation that is matched is not deficient - country name and code are not missing. So the match is justified. Affiliation matching procedure has no way to differentiate the matches because the affiliation is so generic. Looking at the whole affiliation list we can say that the proper match is with Netherlands Leprosy Relief because affiliation number 3 mentions 'Deutsche Lepra- und Tuberkulosehilfe'. But affiliation matching procedure cannot use this information. The only solution is probably to blacklist this match.