Closed anaelizabethenriquez closed 2 years ago
@anaelizabethenriquez I apologize in advance for how long this is. Let me know if a can clarify anything.
Also, adding @nmg110 to this conversation.
I was wrong about how many duplicate groups the auto merger I'm working on resolves. It actually only resolves about 180 duplicate groups (as opposed to the 800 I mentioned during standup). I had not taken into account the duplicate groups where the publications have empty dois, and I was merging them. They are now being skipped.
What I have now will loop through the duplicate groups, and merge any duplicate publications that fit the criteria below one-by-one. If the duplicate group only has one publication left after the merge, it will delete the duplicate group. When the auto merger goes to merge two publications, it first checks if the two publications pass the following criteria (Note that I am only setting this criteria using attributes that are shown on the duplicate publication merging screen since they seem to be the most important):
Both dois must be present and they must match exactly.
One title is present, or both titles are present and they case insensitively match when stripped of all non-alphanumeric characters and spaces.
The same criteria for title is used for the secondary title.
One journal is present, or both journals are present and the journal titles match exactly.
Ignoring. Since the publisher is generally defined through the journal, I felt like checking the journals was enough.
Also ignoring. There were a lot of potential merges where the published dates were several months apart, so I decided to ignore it. Also, when duplicate groups are created, part of the criteria is that the published dates are within a year from one another.
Also ignoring. The statuses are going to be either 'Published' or 'In Press', so it shouldn't affect the decision to merge.
One volume is present, or both volumes are present and they match exactly.
One issue is present, or both issues are present and they match exactly.
One edition is present, or both editions are present and they match exactly.
One page range is present, or both page ranges are present and the page numbers before the '-' match exactly. I was seeing these three formats for page ranges:
1.) 723 2.) 723-725 3.) 723-+
So I decided to just compare the first number (in the above cases the '723').
One ISSN is present, or both ISSNs are present and they match exactly when the '-' is stripped from them.
Either both publication types are exactly the same, or they are both journal articles (have the string 'Journal Article' in the publication type).
If all of the above criteria passes, then the publications can merge. If the criteria is not met, the publications are not merged. The following is how the two records will be merged by attribute:
If titles are exactly the same or only one is specified, then that will be the chosen title.
Otherwise, the longer title will be chosen.
If secondary titles are exactly the same or only one is specified, then that will be the chosen secondary title.
Otherwise, the longer secondary title will be chosen.
Journal must be the same or be specified in only one of the publications, so that journal will be chosen.
Dependent on journal.
If the dates are exactly the same or only one is specified, then that will be the chosen date.
Otherwise, the most recent date will be chosen.
If both are 'In Press' then it will remain 'In Press'. If both are 'Published' then it will remain 'Published'. If one is 'In Press' and the other is 'Published', then 'Published' will be chosen.
Volume must be the same or be specified in only one of the publications, so that volume will be chosen.
Issue must be the same or be specified in only one of the publications, so that issue will be chosen.
Edition must be the same or be specified in only one of the publications, so this edition will be chosen.
If the pages are exactly the same or only one is specified, then that will be the chosen pages.
Otherwise, the longer page range will be chosen. (so '723-725' would be chosen over '723-+' which would be chosen over '723')
If the issns are exactly the same or only one is specified, then that will be the chosen issn.
Otherwise, the longer issn will be chosen. (so '1234-1234' would be chosen over '12341234')
If the publication types are exactly the same or only one is specified, then that will be the chosen publication type.
Otherwise, the longer publication type will be chosen. (so 'Academic Journal Article' would be chosen over 'Journal Article')
It just picks one of the isbns if both publications have an isbn, or picks the only one if only one is defined.
The longer abstract is chosen.
If one or both is true, then true is chose. If both are false, then false is chosen.
It just picks one of the values if both publications have a 'total scopus citations', or picks the only one if only one is defined
It just picks one of the urls if both publications have a url, or picks the only one if only one is defined
Authorships, open access locations, and waivers will all be merged/transferred to the target publication the same way as before.
There is a bit of a filtering process to attach the preferred contributor names to the target publication. All of the contributor names for both publications are pulled together and grouped by the unique combination of the first letter of their first name and their entire last name. Then, those groups are filtered down to the records with the most metadata. So if two contributor names have the same name, but one has the role and position defined and the other does not, then the one with those attributes defined will be chosen as the preferred contributor name. The other will be deleted. Finally, if two names are grouped as the same, and they both have the same amount of metadata, then the fuller name will be chosen. For example 'John Smith' would be chosen over 'J Smith'.
Thanks, @ajkiessl ! Here are a few thoughts on the matching part from me. Feedback on merging coming soon.
Title and Secondary Title
It's common (I think with Pure data) for the subtitle after a colon to be put in the Secondary Title field. With AI data, this subtitle is more commonly in the Title field, after a semicolon. Rather than requiring Title to match Title and Secondary Title to match Secondary Title, could you concatenate the fields, strip non-alphanumeric characters and spaces, and then check for the case insensitive match? I think that would help match a lot more things.
Journal title
One journal is present, or both journals are present and the journal titles match exactly.
Ampersands ("&") are a common problem with journals. If it's not too tricky, it would be nice if "Math & Computers" would be treated as a match for "Math and Computers."
Matching rules for other fields
I like all of these.
Title and Secondary Title
Just taking the longest entry in each of these fields is not ideal. Suppose the actual title is "Math: An Interesting Subject." Pure will have this split across the two fields and Activity Insight won't. So we'd end up with Title = "Math: An Interesting Subject" and Secondary Title = "An Interesting Subject." And then I think that would show up in some places as "Math: An Interesting Subject: An Interesting Subject." Right now, I'm not sure what the solution looks like here.
Date of publication
Let's take the earliest date of publication, rather than the most recent. It's common to have a date of first online publication and a date of the journal issue, which can be much later. We generally care about first online publication (especially for setting embargoes in ScholarSphere).
ISSN
Journals can have various ISSNs (electronic, print, etc.), and they are all 8 (or 9, with hyphen) digits. You'll need some kind of additional tiebreaker here. But I bet you've planned for that already.
All other Merging items
Everything else looks good to me here.
Let's talk about this one on Monday. Ideally, it would be nice if Jackie Smith and John Smith could be coauthors on a paper without one of them getting dropped from it. Most of the time this is unlikely to occur, but a few disciplines (cough cough physics) can have really big groups of coauthors (sometimes more than 1,000). And some last names are very common.
Merging Titles are going to be hard because of the way Pure has the data as Ana states above. When we manual merge and we see the Pure record as Title: "Math", Second Title: "An Interesting Subject", our rule of thumb is to manually edit to "Math: An Interesting Subject.". When the Pure data comes in, prior to merging with other records, is there a way we could magically change the title to Title: Second Title if there is data in the second title field? If we could do that first, that would help tremendously!
@anaelizabethenriquez I should be able to do all of this.
Regarding the merging of titles and secondary titles. I should be able to add some logic to detect when a secondary title is appended to the main title in one record and not in the other record. Given @nmg110's comment, I can merge the records to store the title as "Math: An Interesting Subject." instead of Title: "Math", Secondary Title: "An Interesting Subject". We can edit the Pure importer to automatically combine the secondary and main titles, but I'll have to add that to another ticket and work on that at a later time.
Regarding ISSNs. I do see some ISSNs stored like this in RMD: "1234-1234 (Print) 1234-4321 (Linking)". When matching, I can strip all non-numeric characters except for "X", and check if one record's ISSN is included in the other (and vice versa). When merging, would "1234-1234 (Print) 1234-4321 (Linking)", be preferable to "1234-1234"?
When merging ISSN, I think it's preferable to end up with a single value that's a valid ISSN (i.e., 1234-1234) rather than a field that contains extra info but isn't machine-readable. One day we may want to look up info about the journal using the ISSN.
@anaelizabethenriquez I have a PR ready to be merged for this. I apologize I was in and out of not feeling well last week and did not get as far on this as I would have liked. Right now it clears out ~200 duplicate publication groups from the ~5250 that we currently have. It resolves more duplicates than that, though. A lot of duplicate groups have more than two publications where one does not fit the criteria to merge with this new feature. So some duplicates may get resolved, while the entire duplicate group may not resolve.
I just wanted to mention a few other things that I changed since the last discussion on this.
I could not come up with a reliable way to separate out names that have the same first letter of their first name and the same last name (but are actually different people) by analyzing the names further. I did discover, however, that the position data we have appears to be pretty reliable, so I am now finding unique contributor names on the first letter of the first name, the whole last name, and the position. This takes care of instances where two different contributors may have names like John Smith vs Jane Smith, assuming their position data is correct. My analysis of the position data may be naive, but I checked a lot of records before and after the merge, and it seemed to be solid and consistent data.
We have three related fields for publications journal
, journal_title
, and publisher_name
. journal
links to an actual journal record in the RMD that stores metadata for that journal. The journal record also contains a link to a publisher record. The journal_title
and publisher_name
fields just contain strings. I decided that when merging this data, if there is a journal record attached to one of the merging records, to give this preference and set journal_title
and publisher_name
to blank values even if one of the merging publications has these values set.
When matching publication type, if one of the publication types is 'Other' and the other is not 'Other', then these publications can merge. When merging, the publication type that is not 'Other' is given preference.
@ajkiessl This all looks good to me. Thank you for working on this!
We've discussed merging publications that have the same DOI, without requiring manual review. I thought this was in place already, because @ajkiessl ran a task this morning to automatically deduplicate some publications. However, I'm seeing the following duplicate groups where the publications have matching DOIs (I won't merge these for now):
Seems like there are still a lot of these.