ualbertalib / discovery

Discovery is the University of Alberta Libraries' catalogue interface, built using Blacklight
http://search.library.ualberta.ca
12 stars 3 forks source link

Deduplicate ejournals and databases. #768

Open ghost opened 8 years ago

ghost commented 8 years ago

Plan for ejournals: Jim will split the Symphony extract into three files: serials with an object ID in the 940, serials with nothing or "NO EJOURNAL HOLDINGS" in the 940, and everything else. Then I can merge the serials with a 940 with SFX, load the other serials directly into the journals index, and load the rest of the Symphony extract into Books and more.

Plan for databases: we can match on the 856 URL, and since there are only 1140 databases, we can probably do this on ingest by checking the sirsi web services. Bib services will likely look into other ways we can identify databases, but this should work for me for now.

ghost commented 8 years ago

The issue with this one is not being able to provide a NOT facet value. You can do this in Solr directly, but I can't figure out how you can do this in the BlacklightController....

ghost commented 8 years ago

Another option would be to reject any records that are format:Serial during the Symphony ingest process. But then how to get those rejected indexed into the Journals format? The simplest solution, I think, would be to have Jim produce an everything-but-serials file and a serials file....

sjshores commented 8 years ago

I like your last suggestion. We should think about whether it's a serials file or more precisely a journals file. I prefer that we try to isolate the true journals and leave other kinds of serials in the Books, Media and More pane.

Soctt, what do you think?

Snadra

On 2 December 2015 at 13:34, redlibrarian notifications@github.com wrote:

Another option would be to reject any records that are format:Serial during the Symphony ingest process. But then how to get those rejected indexed into the Journals format? The simplest solution, I think, would be to have Jim produce an everything-but-serials file and a serials file....

— Reply to this email directly or view it on GitHub https://github.com/ualbertalib/discovery/issues/768#issuecomment-161426422 .

ghost commented 8 years ago

@theLinkResolver has brought up the Serials/Journal ambiguity, and is working on a document to explain it to the public services members.

The only problem I can see with producing two files is that we're moving away from "just using what we produce for Ebsco". That's not a problem, necessarily, but will require some thought. Changing this would, however, also help the sysadmins as we could break the single extract up into multiple files for more efficient ingestion. / @sjshores

Not sure if we can get this done by launch, though.

theLinkResolver commented 8 years ago

@redlibrarian would Jim be looking at the Serial format in terms of Symphony record format (which would include integrating resources - not good) or in the same way that the format facet is applied in Blacklight (looking at leader 06-07)? You can see that I am saying the latter is preferred.

@sjshores We can do this to some degree... Looking just at the Symphony records, if leader 06=a and leader 07=s (i.e. records that are faceted as Serials in Blacklight), then we can look at 008 position 21 to differentiate between: monographic series (m), newspaper (n), periodical (p), and annual or less frequent (blank). That's as close as we can get.

SFX records seem to use a different scheme. I don't have access to the XML so I don't know exactly how it works. I can give @redlibrarian some object examples to pull for me to examine if we want to go down this road, but my sense of it is that this deeper faceting may not be possible to do in a consistent way for the aggregated SFX and Symphony records. SFX calls all kinds of things "journal", like annual reports, law reporters, and so on.

ghost commented 8 years ago

I think we've done this...

theLinkResolver commented 8 years ago

@redlibrarian Symphony print + electronic journals are still in the Books pane. I think they should stay that way until the de-duplication issue is resolved in the Journals pane.

ghost commented 8 years ago

OK.

ghost commented 8 years ago

See #900

ghost commented 7 years ago

Plan for ejournals: Jim will split the Symphony extract into three files: serials with an object ID in the 940, serials with nothing or "NO EJOURNAL HOLDINGS" in the 940, and everything else. Then I can merge the serials with a 940 with SFX, load the other serials directly into the journals index, and load the rest of the Symphony extract into Books and more.

Plan for databases: we can match on the 856 URL, and since there are only 1140 databases, we can probably do this on ingest by checking the sirsi web services. Bib services will likely look into other ways we can identify databases, but this should work for me for now.

pgwillia commented 5 years ago

Is this still relevant now that there is no bento?

seanluyk commented 5 years ago

@pgwillia still relevant but not for the same reason. We'd still like to reduplicate based on ISSN and prefer to keep the catalog record instead of the SFX record as it's more complete. Same story with databases but different logic. Severity of the issue is less without Bento design