ualbertalib / discovery

Discovery is the University of Alberta Libraries' catalogue interface, built using Blacklight
http://search.library.ualberta.ca
12 stars 3 forks source link

[Discovery] [Spike] Coverage Information/Title for SFX Data - SFX - holdings are being deduplicated #1473

Open seanluyk opened 5 years ago

seanluyk commented 5 years ago

Describe the bug Journal packages with multiple access points with the same title are being deduplicated in the record (or at ingest?). This means that our journal holdings may not be accurately represented in certain cases.

To Reproduce Steps to reproduce the behavior:

  1. Compare SFX Econometrica record with Blacklight Econometrica Record
  2. Note that in the former, 3 holdings for ProQuest ABI/INFORM Collection with different coverage dates appear, and only 1 does in BL

Screenshot from 2019-03-29 14-42-08 Screenshot from 2019-03-29 14-41-48

Expected behavior Holdings showing in GetIt! window match those in BL exactly.

Additional context This is another strike against data integration in BL. An alternative would be to simply index journal titles and metadata, and link out to SFX for access. In some ways, this is cleaner as it maintains a separation of concerns/removes dependencies with a third party system.

theLinkResolver commented 5 years ago

See #1079 and #1067

weiweishi commented 5 years ago

We can spike on where the coverage information is indexed, and can we use both the Journal title and the coverage to identify duplicates?

pgwillia commented 5 years ago

The 3 holdings for ProQuest ABI/INFORM Collection are not showing up because we are using $s as a unique identifier but each of the holdings containing different coverage have the same value: 2550000000000007 Screenshot from 2019-03-29 13-51-01

https://github.com/ualbertalib/discovery/blob/8dfbb16ccfc34ada4167086e78745a98659f7b81/app/services/sfx_service.rb#L23 https://github.com/ualbertalib/discovery/blob/8dfbb16ccfc34ada4167086e78745a98659f7b81/app/services/sfx_service.rb#L57-L59

Then we look up the link from SFX https://github.com/ualbertalib/discovery/blob/8dfbb16ccfc34ada4167086e78745a98659f7b81/app/services/sfx_service.rb#L28-L29 https://github.com/ualbertalib/discovery/blob/8dfbb16ccfc34ada4167086e78745a98659f7b81/app/services/sfx_service.rb#L81-L83

<target>
   <target_name>PROQUEST_ABI_INFORM_COLLECTION</target_name>
   <target_public_name>ProQuest ABI/INFORM Collection</target_public_name>
   <target_service_id>2550000000000007</target_service_id>
   <service_type>getFullTxt</service_type>
   <parser>PROQUEST::open</parser>
   <parse_param>url=http://gateway.proquest.com/openurl &amp; clientid= &amp; url2=https://search.proquest.com&amp;jkey=48033</parse_param>
   <proxy>no</proxy>
   <crossref>no</crossref>
   <note/>
   <authentication>&lt;iframe src="https://tal.scholarsportal.info/alberta/sfx/?tag=ProQuest_TAL" width="100%" height="40" align="middle" fra
meborder="0" scrolling="no"&gt;&lt;p&gt;Your browser does not support iframes.&lt;/p&gt;&lt;/iframe&gt;
</authentication>
   <char_set>utf8</char_set>
   <displayer/>
   <target_url>http://login.ezproxy.library.ualberta.ca/login?url=http://gateway.proquest.com/openurl?rft_id=48033&amp;res_dat=xri%3Apqm&amp;
rft_val_fmt=info%3Aofi%2Ffmt%3Akev%3Amtx%3Ajournal&amp;url_ver=Z39.88-2004&amp;genre=journal</target_url>
</target>
<target>
   <target_name>PROQUEST_ABI_INFORM_COLLECTION</target_name>
   <target_public_name>ProQuest ABI/INFORM Collection</target_public_name>
   <target_service_id>2550000000000007</target_service_id>
   <service_type>getFullTxt</service_type>
   <parser>PROQUEST::open</parser>
   <parse_param>url=http://gateway.proquest.com/openurl &amp; clientid= &amp; url2=https://search.proquest.com&amp;jkey=48032</parse_param>
   <proxy>no</proxy>
   <crossref>no</crossref>
   <note/>
   <authentication>&lt;iframe src="https://tal.scholarsportal.info/alberta/sfx/?tag=ProQuest_TAL" width="100%" height="40" align="middle" fra
meborder="0" scrolling="no"&gt;&lt;p&gt;Your browser does not support iframes.&lt;/p&gt;&lt;/iframe&gt;
</authentication>
   <char_set>utf8</char_set>
   <displayer/>
   <target_url>http://login.ezproxy.library.ualberta.ca/login?url=http://gateway.proquest.com/openurl?rft_id=48032&amp;res_dat=xri%3Apqm&amp;
rft_val_fmt=info%3Aofi%2Ffmt%3Akev%3Amtx%3Ajournal&amp;url_ver=Z39.88-2004&amp;genre=journal</target_url>
</target>
<target>
   <target_name>PROQUEST_ABI_INFORM_COLLECTION</target_name>
   <target_public_name>ProQuest ABI/INFORM Collection</target_public_name>
   <target_service_id>2550000000000007</target_service_id>
   <service_type>getFullTxt</service_type>
   <parser>PROQUEST::open</parser>
   <parse_param>url=http://gateway.proquest.com/openurl &amp; clientid= &amp; url2=https://search.proquest.com&amp;jkey=48030</parse_param>
   <proxy>no</proxy>
   <crossref>no</crossref>
   <note/>
   <authentication>&lt;iframe src="https://tal.scholarsportal.info/alberta/sfx/?tag=ProQuest_TAL" width="100%" height="40" align="middle" fra
meborder="0" scrolling="no"&gt;&lt;p&gt;Your browser does not support iframes.&lt;/p&gt;&lt;/iframe&gt;
</authentication>
   <char_set>utf8</char_set>
   <displayer/>
   <target_url>http://login.ezproxy.library.ualberta.ca/login?url=http://gateway.proquest.com/openurl?rft_id=48030&amp;res_dat=xri%3Apqm&amp;
rft_val_fmt=info%3Aofi%2Ffmt%3Akev%3Amtx%3Ajournal&amp;url_ver=Z39.88-2004&amp;genre=journal</target_url>
</target>

From what I see here there isn't a way to uniquely identify the holding with coverage by that id.

pgwillia commented 5 years ago

We can spike on where the coverage information is indexed, and can we use both the Journal title and the coverage to identify duplicates?

Coverage information is not indexed specifically. The xml marc record is parsed to get 866 field information ($a is the coverage) and then SFX lookup for the target url. In the response from SFX I don't see anyway to inspect the coverage of each target. Does someone else know more about that?

seanluyk commented 5 years ago

@pgwillia the best person to ask would be Abigail, she may have some ideas on the data side

TracyKitagawa commented 5 years ago

Reporting problem once again, this time with Foreign Affairs (ISSN 0015-7120, SFX object ID 954921343175, target IDs 2550000000000007 AND 2610000000000073); also see OTRS ticket # 20190730108). Abigail will be adding additional commentary re proposed solution.

abigailsparling commented 5 years ago

Is it possible to scrap the deduplication happening here altogether? The serials team discussed this and we think if a solution can't be reached based on the data we have to work with, it would be preferable to display true duplicates so that other unique access points aren't being suppressed. Let me know if we need to meet further to analyse the data and discuss our options.

ghost commented 5 years ago

This deduplication is news to me, I think. I'll have to look at it and see what's going on, but I'm sure we can stop it if it's not helpful.

On Tue, Aug 20, 2019, 2:01 PM Abigail Sparling, notifications@github.com wrote:

Is it possible to scrap the deduplication happening here altogether? The serials team discussed this and we think if a solution can't be reached based on the data we have to work with, it would be preferable to display true duplicates so that other unique access points aren't being suppressed. Let me know if we need to meet further to analyse the data and discuss our options.

— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub https://github.com/ualbertalib/discovery/issues/1473?email_source=notifications&email_token=AAIK3SXVV3MI7HIMGPHUXFTQFRERFA5CNFSM4GRUTD2KYY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOD4XPSBI#issuecomment-523172101, or mute the thread https://github.com/notifications/unsubscribe-auth/AAIK3STOMPNHMC4CEGXE4TTQFRERFANCNFSM4GRUTD2A .

abigailsparling commented 5 years ago

We've reached out to ProQuest to see if they are addressing this duplication on their end and it appears that a fix is not imminent. Any movement on fixing this on our end?

weiweishi commented 5 years ago

Sorry Abigail. We spiked to look into this issued. Based on the information we currently have and the data sources we draw information from, there is no easy and consistent way to fix this within our current system. Sam, let’s discuss how and when we can schedule the work to display the duplicated holding information.

Cheers!

Weiwei

On Sep 20, 2019, at 9:45 AM, Abigail Sparling notifications@github.com wrote:

We've reached out to ProQuest to see if they are addressing this duplication on their end and it appears that a fix is not imminent. Any movement on fixing this on our end?

— You are receiving this because you commented. Reply to this email directly, view it on GitHub, or mute the thread.