ualbertalib / discovery

Discovery is the University of Alberta Libraries' catalogue interface, built using Blacklight
http://search.library.ualberta.ca
12 stars 3 forks source link

Deduplicate Journal records #900

Closed ghost closed 8 years ago

ghost commented 8 years ago

-- ingest symphony records -- print journals display as usual -- electronic journals (if 940 = object id) display SFX holdings instead of print holdings -- for SFX records, preprocess by searching symphony WS by object ID. Expect to retrieve 1 record, if retrieves 0 or 1 record, ingest. If it retrieves more than 1 record, write to log file and skip ingest

ghost commented 8 years ago

OK, the actual skipping of extraneous SFX records seems to work OK. I need to index a lot of records in order to get some "Symphony records with SFX equivalents" in order to modify the display. I'm going to leave the Symphony record set indexing over the weekend and I'll see if I can finish that up on Monday.

ghost commented 8 years ago

So, this logic is sound, but the process is too time consuming. To process all 100,000 or so SFX records and make a WS call for each takes forever. I have an idea for another way to approach this that I'll have to run by Neil, Tricia, and Henry.

If we ingest the SFX records first, then they already exist in the index. When we ingest the Symphony records, any records that has 1-and-only-1 940 value that is not "NO EJOURNAL HOLDINGS" will use the 940 as its ID and not the 001. This means that if there is an ID clash on indexing (which there will be as both the SFX record and the Symphony record will be using the object id as ID), the Symphony record will overwrite the SFX record. In the case where there is no 940 value, or the value is set to "NO EJOURNAL HOLDINGS" then it continues to use the 001 as the ID. In cases where there are more than one 940 value, the catkey (001) should be reported to bib services. I think this will work, but this will require some testing, and I also want to make sure that we aren't breaking best-practices by making ingestion non-idempotent (i.e. order-dependent).

ghost commented 8 years ago

It would be cleanest if your process could eliminate all Symphony journal records where only the MacEwan Internet library has a copy and no other library has a copy.

theLinkResolver commented 8 years ago

I'll log an update here from the Bib Services/Serials Team side (already communicated to @redlibrarian).

Given the dependence on the SFX object ID for the e-journal de-duplication process, I have been working with Jim to obtain reports of problem SFX object ID data in our catalogue records (940 field) to ensure the proposed de-duplication process is well-served by the catalogue side of things. I have:

This work is also closely connected to more expeditiously removing dead links and improving overall processes around e-journal cataloguing.

sjshores commented 8 years ago

A huge undertaking, Scott, but very valuable work.

Thanks!

Sandra

On 10 March 2016 at 10:23, Scott Davies notifications@github.com wrote:

I'll log an update here from the Bib Services/Serials Team side (already communicated to @redlibrarian https://github.com/redlibrarian).

Given the dependence on the SFX object ID for the e-journal de-duplication process, I have been working with Jim to obtain reports of problem SFX object ID data in our catalogue records (940 field) to ensure the proposed de-duplication process is well-served by the catalogue side of things. I have:

  • corrected records with two different SFX object IDs/940s (31 titles)
  • investigated/corrected/replacemed miscoded records containing an SFX openURL (50+ titles)
  • fixed records with multiple SFX openURLs (should only be one) (52 titles)
  • investigated and fixed records with an SFX openURL but no SFX object ID (963 titles)
  • fixed records that contain SFX object ID but no SFX openURL (42 titles)
  • resolved instances when more than one record contains the same SFX object ID (497 titles - in progress)

This work is also closely connected to more expeditiously removing dead links and improving overall processes around e-journal cataloguing.

— Reply to this email directly or view it on GitHub https://github.com/ualbertalib/discovery/issues/900#issuecomment-194965487 .

ghost commented 8 years ago

OK, so, thanks to Scott's work, 940s in the Symphony data have been normalized. We can use these to match SFX record to Symphony record. Ideally, in the case of duplicated records, we would like to keep the Symphony information (because of the richer metadata) but then use the SFX web services to pull in the holdings/subscription information.

There are two ways to go about this, neither of them ideal.

  1. We try to merge or dedupe the records. Update (merge) records wasn't available until Solr 4 and isn't part of the default Blacklight workflow. So online/live merging (which is the best option) would require either waiting for Blacklight or tackling it ourselves (which would mean writing a whole new SFX ingest process, and figuring out how Solr 4 handles merges). If we try offline merging, we have the same problem as before in that our infrastructure isn't set up for it, and the process is extremely slow.
  2. Overwrite SFX records. This would work, though it would require rewriting the SFX holdings service to not use the targets from the SFX records (866 fields). However, it means that the order of ingestion becomes critical, and overwriting records also opens the door for potential (and hard to reason about) problems with overwritten records.... Not sure how to proceed. I've asked @sjshores @kgood @theLinkResolver and @timwklassen for advice.
sjshores commented 8 years ago

Chiming in from beautiful sunny Whistler, I think we need an in-person meeting to discuss the details and ramifications of each choice. My current inclination is to wait for the BL community to take advantage of the option of live merging of records, but I'd like to discuss further.

Sandra

On 1 April 2016 at 12:07, redlibrarian notifications@github.com wrote:

OK, so, thanks to Scott's work, 940s in the Symphony data have been normalized. We can use these to match SFX record to Symphony record. Ideally, in the case of duplicated records, we would like to keep the Symphony information (because of the richer metadata) but then use the SFX web services to pull in the holdings/subscription information.

There are two ways to go about this, neither of them ideal.

1.

We try to merge or dedupe the records. Update (merge) records wasn't available until Solr 4 and isn't part of the default Blacklight workflow. So online/live merging (which is the best option) would require either waiting for Blacklight or tackling it ourselves (which would mean writing a whole new SFX ingest process, and figuring out how Solr 4 handles merges). If we try offline merging, we have the same problem as before in that our infrastructure isn't set up for it, and the process is extremely slow. 2.

Overwrite SFX records. This would work, though it would require rewriting the SFX holdings service to not use the targets from the SFX records (866 fields). However, it means that the order of ingestion becomes critical, and overwriting records also opens the door for potential (and hard to reason about) problems with overwritten records.... Not sure how to proceed. I've asked @sjshores https://github.com/sjshores @kgood https://github.com/kgood @theLinkResolver https://github.com/theLinkResolver and @timwklassen https://github.com/timwklassen for advice.

— You are receiving this because you were mentioned. Reply to this email directly or view it on GitHub https://github.com/ualbertalib/discovery/issues/900#issuecomment-204526673

ghost commented 8 years ago

Thanks @sjshores , that makes sense. I'm going to park this issue for now - we can discuss it when you're back. Hope you had fun skiing.

ghost commented 8 years ago

Referred to in #1002