wcmc-its / ReCiter

ReCiter: an enterprise open source author disambiguation system for academic institutions
Apache License 2.0
45 stars 23 forks source link

Create nameScoringStrategy #214

Closed paulalbert1 closed 5 years ago

paulalbert1 commented 6 years ago

Background

The goal of this scoring strategy is to have a reliable score for how closely any of the names in the Identity table match the targetAuthor's indexed in the article.

Sample data

PubMed

<Author ValidYN="Y">
<LastName>Chen</LastName>
<forename>Kang</forename>
<Initials>K</Initials>
</Author>

Scopus

<author seq="1">
<author-url>https://api.elsevier.com/content/author/author_id/8938650800</author-url>
<authid>8938650800</authid>
<authname>Smith C.</authname>
<surname>Smith</surname>
<given-name>Catherine C.</given-name>
<initials>C.C.</initials>
</author>

Intended output

The goal is to be able to return something in the feature-generator that looks like this...

    "authorNameEvidence": {
        "institutionalAuthorName": {
            "firstName": "Curtis",
            "firstInitial": "C",
            "middleName": "Del",
            "middleInitial": "D",
            "lastName": "Cole"
        },
        "articleAuthorName": {
            "firstName": "Curtis",
            "lastName": "Del Cole"
        },
        "nameMatchFirstType":  "full-exact",
        "nameMatchFirstScore":  2,
        "nameMatchMiddleType":  "inferredInitials-exact",
        "nameMatchMiddleScore":  1,
        "nameMatchLastType":  "full-exact",
        "nameMatchLastScore":  2,
        "nameMatchModifier: "combinedMiddleNameLastName",
        "nameMatchModifierScore: 1
    },

The scoring lookup table for this and other features need to be stored in a single location such as application.properites. Use your judgment about formatting. Here's one option. Note we have a variable, a string value, and an integer value.

nameMatchFirstType  full-exact  2
nameMatchFirstType  inferredInitials-exact  1
nameMatchFirstType  full-fuzzy  0
nameMatchFirstType  noMatch -1
nameMatchFirstType  full-conflictingAllButInitials  -2
nameMatchFirstType  full-conflictingEntirely  -3
nameMatchFirstType  nullTargetAuthor-MatchNotAttempted  -3
nameMatchLastType full-exact  2
nameMatchLastType full-fuzzy  1
nameMatchLastType full-conflictingEntirely  -3
nameMatchLastType nullTargetAuthor-MatchNotAttempted  -3
nameMatchMiddleType full-exact  2
nameMatchMiddleType full-exact  2
nameMatchMiddleType exact-singleInitial 1.5
nameMatchMiddleType inferredInitials-exact  1
nameMatchMiddleType noMatch 0
nameMatchMiddleType full-fuzzy  0
nameMatchMiddleType full-conflictingEntirely  -2
nameMatchMiddleType nullTargetAuthor-MatchNotAttempted  -2
nameMatchMiddleType identityNull-MatchNotAttempted  0
nameMatchModifier incorrectOrder  -1
nameMatchModifier articleSubstringOfIdentity-lastName  -1
nameMatchModifier articleSubstringOfIdentity-firstMiddleName -1 
nameMatchModifier identitySubstringOfArticle-lastName -2
nameMatchModifier identitySubstringOfArticle-firstName -1
nameMatchModifier identitySubstringOfArticle-middleName -1
nameMatchModifier identitySubstringOfArticle-firstMiddleName 1
nameMatchModifier combinedMiddleNameLastName  1

institutionalAuthorName

institutionalAuthorName is the set of possible names as recorded in the Identity table. These are stored in primaryName and alternateNames in the Identity table.

articleAuthorName

articleAuthorName is the name as recorded in the publication metadata.

Pseudocode

A. Decide whether to use Scopus

  1. Is use.scopus.articles=true?

    • if no, we're using the PubMed fields for name, forename and givenName. Now, go to 5.
    • if yes, go to 2
  2. Does number of authors in Scopus equal number of authors in PubMed?

    • if no, we're using the PubMed fields for name, forename and givenName. Now, go to 5.
    • if yes, go to 3
  3. Match target author (nth) in PubMed to target author (nth) in Scopus.

  4. Is length of given-name in Scopus greater than forename in PubMed?

    • if no, we're using the PubMed fields for name, forename and givenName. Now, go to 5.
    • if yes, we're using the Scopus fields for name, surname and given-name.
  5. Using author data from PubMed and Scopus according to above logic, create two fields for all authors: firstName and lastName. Let's call these article.firstName and article.lastName

B. Score the targetAuthor

How many cases where targetAuthor=TRUE were selected?

C. Preprocess all names

Retrieve article.firstName and all distinct cases of identity.firstName and identity.middleName where targetAuthor=TRUE.

Preprocess identity.firstName, identity.middleName, and article.firstName

Retrieve article.lastName where targetAuthor=TRUE and all distinct cases of identity.lastName for our target author from identity. Preprocess identity.lastName and article.lastName.

D. Score the last name

Attempt full exact match where identity.lastName = article.lastName.

Combine following identity.middleName, identity.lastName into mergedName. Now attempt match against article.lastName.

Attempt partial match where "%" + identity.lastName + "%" = article.lastName

Attempt match where identity.lastName >= 4 characters and levenshteinDistance between identity.lastName and article.lastName is <=1.

E. Determine if identity.middleName is available to match against

Identities with no middle name can be divided into two groups:

This logic will help us figure out which case is happening.

  1. Is identity.middleName null in all name variants?

    • If yes, go to F
    • If no, go to 2
  2. Let's decide if we can ignore at least one of the name variants. To do so, they have to be very similar. Are the last names of any two name variants identical?

    • If yes, go to 3
    • If no, go to F
  3. Is one first name variant a substring of another (e.g., Jon vs. Jonathan)?

    • If yes, we can opt to ignore the name variant that does not have a middle name; return to 1 to repeat this process with remaining names
    • If no, name variant needs to be considered separately. Name variants without a middle name should be sent to F. Name variants with a middle name should be sent to G. May the best scoring name variant win!

F. Score the first name in cases where identity.middleName is null

Overview:

Attempt match where identity.firstName = article.firstName

Attempt match where identity.firstName is a left-anchored substring of article.firstName

Attempt match where article.firstName is a left-anchored substring of identity.firstName

Attempt match where first three characters of identity.firstName = first three characters of article.firstName

Attempt match where identity.firstName is greater than 4 characters and Levenshtein distance between identity.firstName and article.firstName is 1.

Attempt match where first character of identity.firstName = first character of article.firstName

Else output the following:

G. Score the first and middle name

Context:

Preprocessing: ignore/discard name variants in which it's pretty clear that one name variant has a middle name that is an abbreviation of another.

Attempt match where identity.firstName + identity.middleName = article.firstName

Attempt match where identity.firstName + "%" + identity.middleName = article.firstName

Attempt match where identity.firstName + identity.middleInitial = article.firstName

Attempt match where identity.firstName + "%" + identity.middleInitial = article.firstName

Attempt match where identity.firstInitial + identity.middleInitial = article.firstName or where identity.firstInitial + " " + identity.middleInitial = article.firstName.

Attempt match where identity.firstInitial + identity.middleName = article.firstName

Attempt match where identity.firstName + identity.middleName + "%" = article.firstName

Attempt match where identity.firstName + identity.middleInitial + "%" = article.firstName

Attempt match where identity.firstName = article.firstName

Attempt match where identity.middleInitial + identity.firstInitial = article.firstName

If there's more than one capital letter in identity.firstName or identity.middleName, attempt match where any capitals in identity.firstName + any capital letters in identity.middleName = article.firstName

If there's more than one capital letter in identity.firstName, attempt match where any capitals in identity.firstName = article.firstName

If there's more than one capital letter in identity.firstName, attempt match where any capitals in identity.firstName + identity.middleName = article.firstName

Attempt match where identity.firstName + "%" = article.firstName

Attempt match where "%" + identity.firstName = article.firstName

Attempt match where identity.middleName = article.firstName

Attempt match where identity.middleName + "%" = article.firstName

Attempt match where "%" + identity.middleName = article.firstName

Attempt match where levenshteinDistance between identity.firstName + identity.middleName and article.firstName is <=2.

Attempt match where identity.firstName >= 4 characters and levenshteinDistance between identity.firstName and article.firstName is <=1.

Attempt match where first three characters of identity.firstName = first three characters of identity.firstName.

Attempt match where identity.firstInitial + "%" + identity.middleName = article.firstName

Attempt match where identity.middleName + identity.firstInitial = article.firstName

Attempt match where article.firstName is only one character and identity.firstName = first character of article.firstName.

Attempt match where first character of identity.firstName = first character of identity.firstName.

Else, we have no match of any kind.

H. Middle name score modification

If middleNameMatchType = full-exact and matching middle name is one character, override that score to nameMatchMiddleType=exact-singleInitial.

paulalbert1 commented 6 years ago

This will override issues #111 and #132, and possibly #127.

paulalbert1 commented 6 years ago

There are a couple opportunities for refinement but this seems to work as intended.

paulalbert1 commented 6 years ago

@sarbajitdutta - A bug for ses9022 and 16614246:

  1. We should be returning this:
    nameMatchModifier: identitySubstringOfArticle-lastName
  2. Instead of "lastName": "Somersankarakaya", we should return the name exactly as recorded in the Identity table: "lastName": "Somersan-Karakaya"
      "evidence": {
        "acceptedRejectedEvidence": null,
        "authorNameEvidence": {
          "institutionalAuthorName": {
            "firstName": "Selin",
            "firstInitial": "S",
            "middleName": null,
            "middleInitial": null,
            "lastName": "Somersankarakaya"
          },
          "articleAuthorName": {
            "firstName": "Selin",
            "firstInitial": "S",
            "middleName": null,
            "middleInitial": null,
            "lastName": "Somersan"
          },
          "nameScoreTotal": -1,
          "nameMatchFirstType": "full-exact",
          "nameMatchFirstScore": 2,
          "nameMatchMiddleType": "identityNull-MatchNotAttempted",
          "nameMatchMiddleScore": 0,
          "nameMatchLastType": "full-conflictingEntirely",
          "nameMatchLastScore": -3,
          "nameMatchModifier": null,
          "nameMatchModifierScore": 0
paulalbert1 commented 5 years ago

Remaining work will be addressed in #289.