Closed paulalbert1 closed 5 years ago
This will override issues #111 and #132, and possibly #127.
There are a couple opportunities for refinement but this seems to work as intended.
@sarbajitdutta - A bug for ses9022 and 16614246:
nameMatchModifier: identitySubstringOfArticle-lastName
"lastName": "Somersankarakaya"
, we should return the name exactly as recorded in the Identity table: "lastName": "Somersan-Karakaya"
"evidence": {
"acceptedRejectedEvidence": null,
"authorNameEvidence": {
"institutionalAuthorName": {
"firstName": "Selin",
"firstInitial": "S",
"middleName": null,
"middleInitial": null,
"lastName": "Somersankarakaya"
},
"articleAuthorName": {
"firstName": "Selin",
"firstInitial": "S",
"middleName": null,
"middleInitial": null,
"lastName": "Somersan"
},
"nameScoreTotal": -1,
"nameMatchFirstType": "full-exact",
"nameMatchFirstScore": 2,
"nameMatchMiddleType": "identityNull-MatchNotAttempted",
"nameMatchMiddleScore": 0,
"nameMatchLastType": "full-conflictingEntirely",
"nameMatchLastScore": -3,
"nameMatchModifier": null,
"nameMatchModifierScore": 0
Remaining work will be addressed in #289.
Background
The goal of this scoring strategy is to have a reliable score for how closely any of the names in the Identity table match the targetAuthor's indexed in the article.
Sample data
PubMed
Scopus
Intended output
The goal is to be able to return something in the feature-generator that looks like this...
The scoring lookup table for this and other features need to be stored in a single location such as application.properites. Use your judgment about formatting. Here's one option. Note we have a variable, a string value, and an integer value.
institutionalAuthorName
institutionalAuthorName is the set of possible names as recorded in the Identity table. These are stored in primaryName and alternateNames in the Identity table.
articleAuthorName
articleAuthorName is the name as recorded in the publication metadata.
Pseudocode
A. Decide whether to use Scopus
Is use.scopus.articles=true?
forename
andgivenName
. Now, go to 5.Does number of authors in Scopus equal number of authors in PubMed?
forename
andgivenName
. Now, go to 5.Match target author (nth) in PubMed to target author (nth) in Scopus.
Is length of
given-name
in Scopus greater thanforename
in PubMed?forename
andgivenName
. Now, go to 5.surname
andgiven-name
.Using author data from PubMed and Scopus according to above logic, create two fields for all authors: firstName and lastName. Let's call these article.firstName and article.lastName
B. Score the targetAuthor
How many cases where targetAuthor=TRUE were selected?
C. Preprocess all names
Retrieve article.firstName and all distinct cases of identity.firstName and identity.middleName where targetAuthor=TRUE.
Preprocess identity.firstName, identity.middleName, and article.firstName
Retrieve article.lastName where targetAuthor=TRUE and all distinct cases of identity.lastName for our target author from identity. Preprocess identity.lastName and article.lastName.
D. Score the last name
Attempt full exact match where identity.lastName = article.lastName.
Combine following identity.middleName, identity.lastName into mergedName. Now attempt match against article.lastName.
Attempt partial match where "%" + identity.lastName + "%" = article.lastName
Attempt match where identity.lastName >= 4 characters and levenshteinDistance between identity.lastName and article.lastName is <=1.
E. Determine if identity.middleName is available to match against
Identities with no middle name can be divided into two groups:
This logic will help us figure out which case is happening.
Is identity.middleName null in all name variants?
Let's decide if we can ignore at least one of the name variants. To do so, they have to be very similar. Are the last names of any two name variants identical?
Is one first name variant a substring of another (e.g., Jon vs. Jonathan)?
F. Score the first name in cases where identity.middleName is null
Overview:
Attempt match where identity.firstName = article.firstName
Attempt match where identity.firstName is a left-anchored substring of article.firstName
Attempt match where article.firstName is a left-anchored substring of identity.firstName
Attempt match where first three characters of identity.firstName = first three characters of article.firstName
Attempt match where identity.firstName is greater than 4 characters and Levenshtein distance between identity.firstName and article.firstName is 1.
Attempt match where first character of identity.firstName = first character of article.firstName
Else output the following:
G. Score the first and middle name
Context:
Preprocessing: ignore/discard name variants in which it's pretty clear that one name variant has a middle name that is an abbreviation of another.
Attempt match where identity.firstName + identity.middleName = article.firstName
Attempt match where identity.firstName + "%" + identity.middleName = article.firstName
Attempt match where identity.firstName + identity.middleInitial = article.firstName
Attempt match where identity.firstName + "%" + identity.middleInitial = article.firstName
Attempt match where identity.firstInitial + identity.middleInitial = article.firstName or where identity.firstInitial + " " + identity.middleInitial = article.firstName.
Attempt match where identity.firstInitial + identity.middleName = article.firstName
Attempt match where identity.firstName + identity.middleName + "%" = article.firstName
Attempt match where identity.firstName + identity.middleInitial + "%" = article.firstName
Attempt match where identity.firstName = article.firstName
Attempt match where identity.middleInitial + identity.firstInitial = article.firstName
If there's more than one capital letter in identity.firstName or identity.middleName, attempt match where any capitals in identity.firstName + any capital letters in identity.middleName = article.firstName
If there's more than one capital letter in identity.firstName, attempt match where any capitals in identity.firstName = article.firstName
If there's more than one capital letter in identity.firstName, attempt match where any capitals in identity.firstName + identity.middleName = article.firstName
Attempt match where identity.firstName + "%" = article.firstName
Attempt match where "%" + identity.firstName = article.firstName
Attempt match where identity.middleName = article.firstName
Attempt match where identity.middleName + "%" = article.firstName
Attempt match where "%" + identity.middleName = article.firstName
Attempt match where levenshteinDistance between identity.firstName + identity.middleName and article.firstName is <=2.
Attempt match where identity.firstName >= 4 characters and levenshteinDistance between identity.firstName and article.firstName is <=1.
Attempt match where first three characters of identity.firstName = first three characters of identity.firstName.
Attempt match where identity.firstInitial + "%" + identity.middleName = article.firstName
Attempt match where identity.middleName + identity.firstInitial = article.firstName
Attempt match where article.firstName is only one character and identity.firstName = first character of article.firstName.
Attempt match where first character of identity.firstName = first character of identity.firstName.
Else, we have no match of any kind.
H. Middle name score modification
If
middleNameMatchType = full-exact
and matching middle name is one character, override that score tonameMatchMiddleType=exact-singleInitial
.