Open seanluyk opened 5 years ago
See #1079 and #1067
We can spike on where the coverage information is indexed, and can we use both the Journal title and the coverage to identify duplicates?
The 3 holdings for ProQuest ABI/INFORM Collection are not showing up because we are using $s
as a unique identifier but each of the holdings containing different coverage have the same value: 2550000000000007
https://github.com/ualbertalib/discovery/blob/8dfbb16ccfc34ada4167086e78745a98659f7b81/app/services/sfx_service.rb#L23 https://github.com/ualbertalib/discovery/blob/8dfbb16ccfc34ada4167086e78745a98659f7b81/app/services/sfx_service.rb#L57-L59
Then we look up the link from SFX https://github.com/ualbertalib/discovery/blob/8dfbb16ccfc34ada4167086e78745a98659f7b81/app/services/sfx_service.rb#L28-L29 https://github.com/ualbertalib/discovery/blob/8dfbb16ccfc34ada4167086e78745a98659f7b81/app/services/sfx_service.rb#L81-L83
<target>
<target_name>PROQUEST_ABI_INFORM_COLLECTION</target_name>
<target_public_name>ProQuest ABI/INFORM Collection</target_public_name>
<target_service_id>2550000000000007</target_service_id>
<service_type>getFullTxt</service_type>
<parser>PROQUEST::open</parser>
<parse_param>url=http://gateway.proquest.com/openurl & clientid= & url2=https://search.proquest.com&jkey=48033</parse_param>
<proxy>no</proxy>
<crossref>no</crossref>
<note/>
<authentication><iframe src="https://tal.scholarsportal.info/alberta/sfx/?tag=ProQuest_TAL" width="100%" height="40" align="middle" fra
meborder="0" scrolling="no"><p>Your browser does not support iframes.</p></iframe>
</authentication>
<char_set>utf8</char_set>
<displayer/>
<target_url>http://login.ezproxy.library.ualberta.ca/login?url=http://gateway.proquest.com/openurl?rft_id=48033&res_dat=xri%3Apqm&
rft_val_fmt=info%3Aofi%2Ffmt%3Akev%3Amtx%3Ajournal&url_ver=Z39.88-2004&genre=journal</target_url>
</target>
<target>
<target_name>PROQUEST_ABI_INFORM_COLLECTION</target_name>
<target_public_name>ProQuest ABI/INFORM Collection</target_public_name>
<target_service_id>2550000000000007</target_service_id>
<service_type>getFullTxt</service_type>
<parser>PROQUEST::open</parser>
<parse_param>url=http://gateway.proquest.com/openurl & clientid= & url2=https://search.proquest.com&jkey=48032</parse_param>
<proxy>no</proxy>
<crossref>no</crossref>
<note/>
<authentication><iframe src="https://tal.scholarsportal.info/alberta/sfx/?tag=ProQuest_TAL" width="100%" height="40" align="middle" fra
meborder="0" scrolling="no"><p>Your browser does not support iframes.</p></iframe>
</authentication>
<char_set>utf8</char_set>
<displayer/>
<target_url>http://login.ezproxy.library.ualberta.ca/login?url=http://gateway.proquest.com/openurl?rft_id=48032&res_dat=xri%3Apqm&
rft_val_fmt=info%3Aofi%2Ffmt%3Akev%3Amtx%3Ajournal&url_ver=Z39.88-2004&genre=journal</target_url>
</target>
<target>
<target_name>PROQUEST_ABI_INFORM_COLLECTION</target_name>
<target_public_name>ProQuest ABI/INFORM Collection</target_public_name>
<target_service_id>2550000000000007</target_service_id>
<service_type>getFullTxt</service_type>
<parser>PROQUEST::open</parser>
<parse_param>url=http://gateway.proquest.com/openurl & clientid= & url2=https://search.proquest.com&jkey=48030</parse_param>
<proxy>no</proxy>
<crossref>no</crossref>
<note/>
<authentication><iframe src="https://tal.scholarsportal.info/alberta/sfx/?tag=ProQuest_TAL" width="100%" height="40" align="middle" fra
meborder="0" scrolling="no"><p>Your browser does not support iframes.</p></iframe>
</authentication>
<char_set>utf8</char_set>
<displayer/>
<target_url>http://login.ezproxy.library.ualberta.ca/login?url=http://gateway.proquest.com/openurl?rft_id=48030&res_dat=xri%3Apqm&
rft_val_fmt=info%3Aofi%2Ffmt%3Akev%3Amtx%3Ajournal&url_ver=Z39.88-2004&genre=journal</target_url>
</target>
From what I see here there isn't a way to uniquely identify the holding with coverage by that id.
We can spike on where the coverage information is indexed, and can we use both the Journal title and the coverage to identify duplicates?
Coverage information is not indexed specifically. The xml marc record is parsed to get 866 field information ($a is the coverage) and then SFX lookup for the target url. In the response from SFX I don't see anyway to inspect the coverage of each target. Does someone else know more about that?
@pgwillia the best person to ask would be Abigail, she may have some ideas on the data side
Reporting problem once again, this time with Foreign Affairs (ISSN 0015-7120, SFX object ID 954921343175, target IDs 2550000000000007 AND 2610000000000073); also see OTRS ticket # 20190730108). Abigail will be adding additional commentary re proposed solution.
Is it possible to scrap the deduplication happening here altogether? The serials team discussed this and we think if a solution can't be reached based on the data we have to work with, it would be preferable to display true duplicates so that other unique access points aren't being suppressed. Let me know if we need to meet further to analyse the data and discuss our options.
This deduplication is news to me, I think. I'll have to look at it and see what's going on, but I'm sure we can stop it if it's not helpful.
On Tue, Aug 20, 2019, 2:01 PM Abigail Sparling, notifications@github.com wrote:
Is it possible to scrap the deduplication happening here altogether? The serials team discussed this and we think if a solution can't be reached based on the data we have to work with, it would be preferable to display true duplicates so that other unique access points aren't being suppressed. Let me know if we need to meet further to analyse the data and discuss our options.
— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub https://github.com/ualbertalib/discovery/issues/1473?email_source=notifications&email_token=AAIK3SXVV3MI7HIMGPHUXFTQFRERFA5CNFSM4GRUTD2KYY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOD4XPSBI#issuecomment-523172101, or mute the thread https://github.com/notifications/unsubscribe-auth/AAIK3STOMPNHMC4CEGXE4TTQFRERFANCNFSM4GRUTD2A .
We've reached out to ProQuest to see if they are addressing this duplication on their end and it appears that a fix is not imminent. Any movement on fixing this on our end?
Sorry Abigail. We spiked to look into this issued. Based on the information we currently have and the data sources we draw information from, there is no easy and consistent way to fix this within our current system. Sam, let’s discuss how and when we can schedule the work to display the duplicated holding information.
Cheers!
Weiwei
On Sep 20, 2019, at 9:45 AM, Abigail Sparling notifications@github.com wrote:
We've reached out to ProQuest to see if they are addressing this duplication on their end and it appears that a fix is not imminent. Any movement on fixing this on our end?
— You are receiving this because you commented. Reply to this email directly, view it on GitHub, or mute the thread.
Describe the bug Journal packages with multiple access points with the same title are being deduplicated in the record (or at ingest?). This means that our journal holdings may not be accurately represented in certain cases.
To Reproduce Steps to reproduce the behavior:
Expected behavior Holdings showing in GetIt! window match those in BL exactly.
Additional context This is another strike against data integration in BL. An alternative would be to simply index journal titles and metadata, and link out to SFX for access. In some ways, this is cleaner as it maintains a separation of concerns/removes dependencies with a third party system.