k-int / gokb-phase1

Original GOKb repo - Moving to https://github.com/openlibraryenvironment/gokb
http://www.gokb.org
Other
11 stars 5 forks source link

Packages ingesting partially in Refine -- GOKb test #485

Closed jhsolomon closed 8 years ago

jhsolomon commented 8 years ago

I have two packages fail to ingest 100% in GOKb test. ScienceDirectStandard_Global_BackfilePackage Biochemistry GeneticsAndMolecularBiology Legacy: 20160322 (87%)

Brill_online_journals: 20160322 (93%)

jhsolomon commented 8 years ago

And this package will not ingest at all: Project Muse: American Literature: 20160322 file attached: am_lit_projectmuse.txt

image

ianibo commented 8 years ago

We think this is caused by the copy of live to test, and missing files. Can we retest with a completely clean file just to be on the safe side? If the problem still exists, will need @sosguthorpe to dig in.

jhsolomon commented 8 years ago

@ianibo @sosguthorpe In Brill: Online Journals: 20160404 it is still failing to ingest 1 title: 135. Islamic Africa.

So far, I have not had this issue with other packages.

ianibo commented 8 years ago

Caused by a title which has an ISSN matching an eISSN - II to write report on duplicate identifiers.

ianibo commented 8 years ago

These are the identifiers on LIVE that have duplicates between issn and eiSSN

mysql> select idvalue, count() from kbcomponent where id_value is not null group by idvalue having count() > 1 limit 10; +-----------+----------+ | id_value | count(*) | +-----------+----------+ | 0001-1843 | 2 | | 0001-2092 | 3 | | 0001-2793 | 2 | | 0001-2815 | 2 | | 0001-2998 | 2 | | 0001-4001 | 2 | | 0001-4346 | 2 | | 0001-4575 | 2 | | 0001-4842 | 2 | | 0001-4966 | 2 | +-----------+----------+ 10 rows in set (2.70 sec)

on TEST there are more than 5000 - which means something has gone horribly wrong. Am investigating what has happened on test.

ianibo commented 8 years ago

On live we have 400+ identifier namespaces of taylor and francis. This is because at some point, we've had a "Taylor & Francis" and then a "taylor & francis" was added. At this point, a bug in the code caused a new namespace to be created every time. We've corrected the bug, but are left with 400+ T&F namespaces that need to be cleaned up. Working out the SQL...

jhsolomon commented 8 years ago

thanks, this is very helpful. let me know if you would like me to re-ingest these packages or test in any way.

On Wed, Apr 6, 2016 at 6:17 AM, Ian Ibbotson notifications@github.com wrote:

On live we have 400+ identifier namespaces of taylor and francis. This is because at some point, we've had a "Taylor & Francis" and then a "taylor & francis" was added. At this point, a bug in the code caused a new namespace to be created every time. We've corrected the bug, but are left with 400+ T&F namespaces that need to be cleaned up. Working out the SQL...

— You are receiving this because you authored the thread. Reply to this email directly or view it on GitHub https://github.com/k-int/gokb-phase1/issues/485#issuecomment-206290105

Jennifer Solomon GOKb Editor, Acquisitions and Discovery North Carolina State University Libraries 919-515-2743 j kristen_wilson@ncsu.eduhsolomo@ncsu.edu

ianibo commented 8 years ago

In the title lookup service, we configure cross checks for identifier families. Previously, this was only defined as

"cross_checks" : [ ["issn", "eissn"], ],

This means that depending upon which values comes in first, the issn or the eissn, the reciprocal cross check might not fire. Have added the reciprocal

["eissn", "issn"],

@sosguthorpe, @ostephens I need a logic check here - does that sound right? This is checked on to dev at the moment, and we need to figure out a way to resolve the duplicate titles we already have.

ianibo commented 8 years ago

Fixed boundary condition in importer chunking code. retest please.

jhsolomon commented 8 years ago

I tested Brill: Online Journals: test 2 and it showed a partial ingest. When I looked at the package in Refine, the title that was not ingested was Islamic Africa. I checked in the CRED and there were two title records for Islamic Africa. I deleted one title record and then re-ingested the package.

The second time, the file still ingested partially due to the same title. I verified the ISSN and eISSN in the file with the ISSN and eISSN in the CRED. The ISSN in the file was actually the ISSN for a previous title Sudanic Africa (Norway) (0803-0685). I changed the ISSN in the file to the correct one (which was already in the CRED (2333-262X) and reingested a third time.

This time the package ingested completely.