psu-libraries / researcher-metadata

Penn State University's faculty and research metadata repository
https://metadata.libraries.psu.edu/
MIT License
7 stars 0 forks source link

Merge publications based on DOI #374

Closed anaelizabethenriquez closed 2 years ago

anaelizabethenriquez commented 2 years ago

We've discussed merging publications that have the same DOI, without requiring manual review. I thought this was in place already, because @ajkiessl ran a task this morning to automatically deduplicate some publications. However, I'm seeing the following duplicate groups where the publications have matching DOIs (I won't merge these for now):

Seems like there are still a lot of these.

ajkiessl commented 2 years ago

@anaelizabethenriquez I apologize in advance for how long this is. Let me know if a can clarify anything.

Also, adding @nmg110 to this conversation.

I was wrong about how many duplicate groups the auto merger I'm working on resolves. It actually only resolves about 180 duplicate groups (as opposed to the 800 I mentioned during standup). I had not taken into account the duplicate groups where the publications have empty dois, and I was merging them. They are now being skipped.

What I have now will loop through the duplicate groups, and merge any duplicate publications that fit the criteria below one-by-one. If the duplicate group only has one publication left after the merge, it will delete the duplicate group. When the auto merger goes to merge two publications, it first checks if the two publications pass the following criteria (Note that I am only setting this criteria using attributes that are shown on the duplicate publication merging screen since they seem to be the most important):

Doi

Both dois must be present and they must match exactly.

Title

One title is present, or both titles are present and they case insensitively match when stripped of all non-alphanumeric characters and spaces.

Secondary Title

The same criteria for title is used for the secondary title.

Journal

One journal is present, or both journals are present and the journal titles match exactly.

Publisher

Ignoring. Since the publisher is generally defined through the journal, I felt like checking the journals was enough.

Date of publication

Also ignoring. There were a lot of potential merges where the published dates were several months apart, so I decided to ignore it. Also, when duplicate groups are created, part of the criteria is that the published dates are within a year from one another.

Status

Also ignoring. The statuses are going to be either 'Published' or 'In Press', so it shouldn't affect the decision to merge.

Volume

One volume is present, or both volumes are present and they match exactly.

Issue

One issue is present, or both issues are present and they match exactly.

Edition

One edition is present, or both editions are present and they match exactly.

Pages (page range)

One page range is present, or both page ranges are present and the page numbers before the '-' match exactly. I was seeing these three formats for page ranges:

1.) 723 2.) 723-725 3.) 723-+

So I decided to just compare the first number (in the above cases the '723').

ISSN

One ISSN is present, or both ISSNs are present and they match exactly when the '-' is stripped from them.

Publication type

Either both publication types are exactly the same, or they are both journal articles (have the string 'Journal Article' in the publication type).

Merging

If all of the above criteria passes, then the publications can merge. If the criteria is not met, the publications are not merged. The following is how the two records will be merged by attribute:

Title

If titles are exactly the same or only one is specified, then that will be the chosen title.

Otherwise, the longer title will be chosen.

Secondary Title

If secondary titles are exactly the same or only one is specified, then that will be the chosen secondary title.

Otherwise, the longer secondary title will be chosen.

Journal

Journal must be the same or be specified in only one of the publications, so that journal will be chosen.

Publisher

Dependent on journal.

Date of publication

If the dates are exactly the same or only one is specified, then that will be the chosen date.

Otherwise, the most recent date will be chosen.

Status

If both are 'In Press' then it will remain 'In Press'. If both are 'Published' then it will remain 'Published'. If one is 'In Press' and the other is 'Published', then 'Published' will be chosen.

Volume

Volume must be the same or be specified in only one of the publications, so that volume will be chosen.

Issue

Issue must be the same or be specified in only one of the publications, so that issue will be chosen.

Edition

Edition must be the same or be specified in only one of the publications, so this edition will be chosen.

Pages (page range)

If the pages are exactly the same or only one is specified, then that will be the chosen pages.

Otherwise, the longer page range will be chosen. (so '723-725' would be chosen over '723-+' which would be chosen over '723')

ISSN

If the issns are exactly the same or only one is specified, then that will be the chosen issn.

Otherwise, the longer issn will be chosen. (so '1234-1234' would be chosen over '12341234')

Publication type

If the publication types are exactly the same or only one is specified, then that will be the chosen publication type.

Otherwise, the longer publication type will be chosen. (so 'Academic Journal Article' would be chosen over 'Journal Article')

ISBN

It just picks one of the isbns if both publications have an isbn, or picks the only one if only one is defined.

Abstract

The longer abstract is chosen.

Author et al

If one or both is true, then true is chose. If both are false, then false is chosen.

Total scopus citations

It just picks one of the values if both publications have a 'total scopus citations', or picks the only one if only one is defined

Url

It just picks one of the urls if both publications have a url, or picks the only one if only one is defined

Other stuff

Authorships, open access locations, and waivers will all be merged/transferred to the target publication the same way as before.

Contributor Names

There is a bit of a filtering process to attach the preferred contributor names to the target publication. All of the contributor names for both publications are pulled together and grouped by the unique combination of the first letter of their first name and their entire last name. Then, those groups are filtered down to the records with the most metadata. So if two contributor names have the same name, but one has the role and position defined and the other does not, then the one with those attributes defined will be chosen as the preferred contributor name. The other will be deleted. Finally, if two names are grouped as the same, and they both have the same amount of metadata, then the fuller name will be chosen. For example 'John Smith' would be chosen over 'J Smith'.

anaelizabethenriquez commented 2 years ago

Thanks, @ajkiessl ! Here are a few thoughts on the matching part from me. Feedback on merging coming soon.

Matching

Title and Secondary Title

It's common (I think with Pure data) for the subtitle after a colon to be put in the Secondary Title field. With AI data, this subtitle is more commonly in the Title field, after a semicolon. Rather than requiring Title to match Title and Secondary Title to match Secondary Title, could you concatenate the fields, strip non-alphanumeric characters and spaces, and then check for the case insensitive match? I think that would help match a lot more things.

Journal title

One journal is present, or both journals are present and the journal titles match exactly.

Ampersands ("&") are a common problem with journals. If it's not too tricky, it would be nice if "Math & Computers" would be treated as a match for "Math and Computers."

Matching rules for other fields

I like all of these.

anaelizabethenriquez commented 2 years ago

Merging

Title and Secondary Title

Just taking the longest entry in each of these fields is not ideal. Suppose the actual title is "Math: An Interesting Subject." Pure will have this split across the two fields and Activity Insight won't. So we'd end up with Title = "Math: An Interesting Subject" and Secondary Title = "An Interesting Subject." And then I think that would show up in some places as "Math: An Interesting Subject: An Interesting Subject." Right now, I'm not sure what the solution looks like here.

Date of publication

Let's take the earliest date of publication, rather than the most recent. It's common to have a date of first online publication and a date of the journal issue, which can be much later. We generally care about first online publication (especially for setting embargoes in ScholarSphere).

ISSN

Journals can have various ISSNs (electronic, print, etc.), and they are all 8 (or 9, with hyphen) digits. You'll need some kind of additional tiebreaker here. But I bet you've planned for that already.

All other Merging items

Everything else looks good to me here.

Contributor Names

Let's talk about this one on Monday. Ideally, it would be nice if Jackie Smith and John Smith could be coauthors on a paper without one of them getting dropped from it. Most of the time this is unlikely to occur, but a few disciplines (cough cough physics) can have really big groups of coauthors (sometimes more than 1,000). And some last names are very common.

nmg110 commented 2 years ago

Merging Titles are going to be hard because of the way Pure has the data as Ana states above. When we manual merge and we see the Pure record as Title: "Math", Second Title: "An Interesting Subject", our rule of thumb is to manually edit to "Math: An Interesting Subject.". When the Pure data comes in, prior to merging with other records, is there a way we could magically change the title to Title: Second Title if there is data in the second title field? If we could do that first, that would help tremendously!

ajkiessl commented 2 years ago

@anaelizabethenriquez I should be able to do all of this.

Regarding the merging of titles and secondary titles. I should be able to add some logic to detect when a secondary title is appended to the main title in one record and not in the other record. Given @nmg110's comment, I can merge the records to store the title as "Math: An Interesting Subject." instead of Title: "Math", Secondary Title: "An Interesting Subject". We can edit the Pure importer to automatically combine the secondary and main titles, but I'll have to add that to another ticket and work on that at a later time.

Regarding ISSNs. I do see some ISSNs stored like this in RMD: "1234-1234 (Print) 1234-4321 (Linking)". When matching, I can strip all non-numeric characters except for "X", and check if one record's ISSN is included in the other (and vice versa). When merging, would "1234-1234 (Print) 1234-4321 (Linking)", be preferable to "1234-1234"?

anaelizabethenriquez commented 2 years ago

When merging ISSN, I think it's preferable to end up with a single value that's a valid ISSN (i.e., 1234-1234) rather than a field that contains extra info but isn't machine-readable. One day we may want to look up info about the journal using the ISSN.

ajkiessl commented 2 years ago

@anaelizabethenriquez I have a PR ready to be merged for this. I apologize I was in and out of not feeling well last week and did not get as far on this as I would have liked. Right now it clears out ~200 duplicate publication groups from the ~5250 that we currently have. It resolves more duplicates than that, though. A lot of duplicate groups have more than two publications where one does not fit the criteria to merge with this new feature. So some duplicates may get resolved, while the entire duplicate group may not resolve.

I just wanted to mention a few other things that I changed since the last discussion on this.

Contributor names

I could not come up with a reliable way to separate out names that have the same first letter of their first name and the same last name (but are actually different people) by analyzing the names further. I did discover, however, that the position data we have appears to be pretty reliable, so I am now finding unique contributor names on the first letter of the first name, the whole last name, and the position. This takes care of instances where two different contributors may have names like John Smith vs Jane Smith, assuming their position data is correct. My analysis of the position data may be naive, but I checked a lot of records before and after the merge, and it seemed to be solid and consistent data.

Journals

We have three related fields for publications journal, journal_title, and publisher_name. journal links to an actual journal record in the RMD that stores metadata for that journal. The journal record also contains a link to a publisher record. The journal_title and publisher_name fields just contain strings. I decided that when merging this data, if there is a journal record attached to one of the merging records, to give this preference and set journal_title and publisher_name to blank values even if one of the merging publications has these values set.

Publication type

When matching publication type, if one of the publication types is 'Other' and the other is not 'Other', then these publications can merge. When merging, the publication type that is not 'Other' is given preference.

anaelizabethenriquez commented 2 years ago

@ajkiessl This all looks good to me. Thank you for working on this!