sul-dlss / sul_pub

SUL system for harvest and managing publications for Stanford CAP, with controlled API access.
http://cap.stanford.edu
Other
8 stars 3 forks source link

Investigate duplicate publication identifiers in pub_hash #8

Open mejackreed opened 8 years ago

mejackreed commented 8 years ago

From @darrenleeweber on February 5, 2016 20:41

The sul-cap-dev platform has some publication data with duplicate publication identifiers in the pub_hash, e.g.

"identifier": [{
        "type": "PMID",
        "id": "10000166",
        "url": "http://www.ncbi.nlm.nih.gov/pubmed/10000166"
    }, {
        "type": "SULPubId",
        "id": "1",
        "url": "http://sulcap.stanford.edu/publications/1"
    }, {
        "type": "SULPubId",
        "id": "1",
        "url": "http://sulcap.stanford.edu/publications/1"
    }],

Copied from original issue: sul-dlss/sul-pub#46

mejackreed commented 8 years ago

From @peetucket on April 13, 2016 23:10

Example of a duplicate publication in production:

cap_profile_id='45761'

two identical publications (except for "." at the end of one title): "Insights on the marine microbial nitrogen cycle from isotopic approaches to nitrification." and "Insights on the marine microbial nitrogen cycle from isotopic approaches to nitrification"

publication id = 239481 (sciencewireID=61620564,pmid=23091468) publication id = 308362 (sciencewireID=65369473,pmid=blank)

dazza-codes commented 7 years ago

Similar work on authorship duplicates was done in

peetucket commented 7 years ago

This should be cleaned up by rebuilding pub_hashes after cleaning up the pub identifiers table (work in #285)

dazza-codes commented 7 years ago

Yes, although the work in #285 and similar identifiers work in this sprint is focused only on removing empty stuff, discarding invalid stuff and normalizing the rest of it. In other words, that work will only touch a subset of the PublicationIdentifiers (and some of that work has not updated the associated Publication.pub_hash data). This issue is about inspecting the entire set of Publications; it's best to do it after the cleanup tasks.