ualbertalib / discovery

Discovery is the University of Alberta Libraries' catalogue interface, built using Blacklight
http://search.library.ualberta.ca
12 stars 3 forks source link

[Spike] learning about ingest #1190

Closed pgwillia closed 5 years ago

pgwillia commented 6 years ago

@piyapongch Here's some information to get you started.

I'm not completely familiar with the ingest process either. It seems like there is a bunch of stale code in the project. We can probably both work to improve the documentation.

pgwillia commented 6 years ago

It seems like all the actual work is done in the SolrMarc dir.

pgwillia commented 6 years ago

The Solr config is in a separate project. The Solr schema is a mix of static and dynamic

pgwillia commented 6 years ago

You are probably able to query the production like solrcloud-test from here

pgwillia commented 6 years ago

This is the output from what seems like a completed without issue ingest bundle exec rake ingest[symphony_test_set] ingest.log

pgwillia commented 6 years ago

@redlibrarian you might have some other resources which will help us make sense of the ingest work. I get the sense that there is a bunch of stale code here. Is that the case?

ghost commented 6 years ago

@pgwillia I'm not sure what kind of resources you're looking for - resources on MARC, SolrMarc, or non-MARC ingest? Let me know and I can see what I can find, though tbh all of what I figured out was from hunting for things (i.e. there was never a set of canonical resources).

Other than a couple of BeanShell scripts in index_scripts the code I wrote for ingest is in lib/ingest and lib/tasks. Most of that code is still being actively used, as far as I know (so not stale in that sense). In lib/ingest, there are some proof of concept mappings like https://github.com/ualbertalib/discovery/blob/master/lib/ingest/peel_mods_om.rb and https://github.com/ualbertalib/discovery/blob/master/lib/ingest/promoted_services_om.rb so they're stale in the sense that they aren't currently used in production, but we will still likely need to make use of them at some point.

pgwillia commented 6 years ago

@redlibrarian We were looking at the index.properties And weren't sure how to understand and write this type of content. In the following example what does first refer to? How about custom? FullRecordAsXML?

id = 001:090a, first
marc_display = FullRecordAsXML
text = custom, getAllSearchableFields(100, 900)
ghost commented 6 years ago

@pgwillia These might help: https://github.com/solrmarc/solrmarc/wiki, https://github.com/solrmarc/solrmarc/wiki/Index-Specification-File

For those specific examples, for id, the index properties file specifies that it should look in the MARC 001 and 090 (subfield a) fields for a value, then take the first one it finds. That particular index.properties is the default one that came with Blacklight. The ones we are using for ingest are symphony_index.properties and sfx_index.properties for Symphony and SFX respectively (there's one in there for the Kule Folklore records but we aren't ingesting those yet). If you look in those two files, you'll see that the id field has been split: for Symphony records it looks in the 001 and for SFX fields it looks in the 090a, because our cataloguing practices specify that that's where the ID is held in records from those systems.

marc_display holds a representation of the full MARC record in XML, so there's a function called FullRecordAsXML which must return an XML representation of the MARC record. I don't know where that function lives.

custom means that it uses a custom SolrMarc function rather than any of the default field mapping. getAllSearchableFields is one of the predefined custom functions.

A clearer example of custom mapping that I've done is e.g. institution_tesim which specifies a BeanShell script (institution.bsh) and then calls a particular mapping function within the script (in this case getInstitutions()).

pgwillia commented 6 years ago

I discovered that the SolrMarc.jar that's used (for me) is located at ~/.rvm/gems/ruby-2.1.5@discovery/gems/blacklight-marc-5.10.0/lib/SolrMarc.jar. I found it trying to track down how log4j is configured so that I could tweak it away from depositing logging into that gem directory and writing everything out to STDOUT. I concluded that I couldn't without building my own SolrMarc.jar because the log4j.properties that's writing to STDOUT is embedded in the jar file. Hopefully that conclusion is wrong but I spent too much time trying to track it down.

It looks like it was build Oct 14 2011, which looks to correspond with release 2.3.1.

I've attached the output of RAILS_ENV=ingest bundle exec rake --trace ingest[symphony] &>ingest.log run from this branch

pgwillia commented 6 years ago

On UAT we talked about

  999  cd discovery/ # this is the directory for the discovery project, there are also others for avalon, jupiter and dmp
 1000  docker-compose ps # this shows all of the running docker containers relative to the current directory
 1001  docker ps  # this shows all of the running docker containers on the server
 1005  less docker-compose.yml # this is the docker-compose file that describes the components of the application
 1008  docker exec -it discovery_web_1 /bin/bash # this attaches a bash shell to the running web container and allows you to run commands like you would if you logged into the application server

And in the web container we looked at

    9  echo $RAILS_ENV #running uat
   11  bundle exec rake ingest['symphony'] # this is the type of ingest task that is run by cron and Neil, expects data in a data/sample.mrc file which Neil would have copied from Jim's export task
   12  bundle exec rake ingest['symphony_test_set'] # this is an ingest task you can run because it uses data in the fixtures directory
   13  less config/logger.yml  # this is where the logging for the rake tasks is configured 
   14  cat /var/log/blacklight_ingest.log 
   15  bundle show blacklight-marc # this is where the blacklight-marc gem is located
   16  cat /usr/local/rvm/gems/ruby-2.1.5/gems/blacklight-marc-5.10.0/lib/solrmarc.log # same output that is spewed to STDOUT
   17  curl "http://solr:8983/solr/discovery/select?indent=on&q=*:*&qt=standard&wt=json" # this is how you could query the solr instance that ingest sends data to
   18  bundle exec rake solr:marc:index:info # this is some info about what SolrMarc.jar is invoked with MARC_FILE and CONFIG_PATH environment variables are set by the rake ingest task.
pgwillia commented 6 years ago
$ bundle exec rake solr:marc:index:info
  Solr to write to is taken from current environment in config/solr.yml,
  key :replicate_master_url is supported, taking precedence over :url
  for where to write to.

  Possible environment variables, with settings as invoked. You can set these
  variables on the command line, eg:
        rake solr:marc:index MARC_FILE=/some/file.mrc

  MARC_FILE: [marc records path needed]

  CONFIG_PATH: /home/pjenkins/Code/discovery/config/SolrMarc/config.properties
     Defaults to RAILS_ROOT/config/SolrMarc/config(-RAILS_ENV).properties
     or else RAILS_ROOT/vendor/plugins/blacklight/SolrMarc/config ...

     Note that SolrMarc search path includes directory of config_path,
     so translation_maps and index_scripts dirs will be found there.

  SOLRMARC_JAR_PATH: /home/pjenkins/.rvm/gems/ruby-2.1.5/gems/blacklight-marc-5.10.0/lib/SolrMarc.jar

  SOLRMARC_MEM_ARGS: -Xmx512m

  SolrMarc command that will be run:

  java -Xmx512m  -Dsolr.hosturl=http://localhost:8983/solr/discovery  -jar /home/pjenkins/.rvm/gems/ruby-2.1.5/gems/blackl
ight-marc-5.10.0/lib/SolrMarc.jar /home/pjenkins/Code/discovery/config/SolrMarc/config.properties
seanluyk commented 6 years ago

Some more information from Jim on what gets excluded from incremental/nightly ingests:

Items that Do Not have these locations IN_PROCESS UNKNOWN MISSING LOST DISCARD LOST-PAID LONGOVRDUE CANC_ORDER ON_THE_FLY LOST-ASSUM LOST-CLAIM INSHIPPING STOR_DARK STOR_RCRF (as of 2018-09-04)

Titles that Do Not have a title for Tag 245

Exclude: Shadowed Titles Shadowed Call Numbers Shadowed Items

I've got the actual scripts too, if anyone is interested.

seanluyk commented 6 years ago

Adding some more notes from Neil:

s designed, there are 3 separate sub-collections: sfx databases symphony Each one has a daily cronjob, meant to pull updates off the original source, keeping the Solr index up to date. For sfx & databases, relatively simple cronjobs are run on york, one of the application servers.

For symphony, the daily incremental changes are pulled by Jim's script on ualapp, also run from cron, using the Symphony API, with the MARC-formatted records collected in a file. In the same script, the file is copied to york, and the ingest job is started, then the file also copied out to EBSCO, which they ingest later into their system. This script is the result of much hard work by Jim, and rather than re-invent a very clever wheel, Sam and I decided to merely add a few lines to it

If you'll refer to the original ticket, there were accidentally two copies of the crontab for databases. Ansible builds one in the 'sirsi' account, but it stopped working, and was actually removing 100,000 journals from solr every night, and failing to replace them. I commented that one out. I manually built a replacement in the 'root' account, and it worked fine... but when I used the playbook recently, it resurrected the copy in the 'sirsi' account, so we accidentally had two copies. The copy in 'sirsi' ran at 10:00 pm, and removed the records. The copy in 'root' ran at 10:30pm and replaced the records. In between, Nagios noticed the records were missing and alerted Wendy, who was oncall.

Steps to resolution: I removed the entry from the 'sirsi' crontab, manually I modified the playbook, to create the crontab in the 'root' account This leaves us with a dangling thread: when I run the playbook next time, it will create a duplicate entry for databases, at 10:00pm, in the 'root' crontab. I've left the ticket open to remind myself to remove the one I manually created for 10:30pm. If I forget, it won't cause any problems, it's just weird, and not optimal.

We could immediately remedy this situation by playing the playbook, but this will cause a brief, unnecessary application outage... it makes more sense to just wait for Natasha's next tagged update. Or, I suppose, I could take york out of the haproxy pool, use the playbook, but limiting it only to york, and then put york back into the pool afterwards (and, remove the 10:30pm entry), and close the ticket.

If you're adding more sub-collections that are appropriate for nightly update, we might make use of the existing 'ingest' rake task framework, and merely add another crontab. Or not! It's highly dependent on the collection. But recall that each month, I build a new Solr collection from scratch, so I need a way to ingest each of the sub-collections into the new index.

theLinkResolver commented 6 years ago

@seanluyk Well that would explain why you're getting the records for STOR_RCRF items, which are supposed to be shadowed (this is the 9-month backlog of items we've been processing for RCRF during the move).

seanluyk commented 6 years ago

@theLinkResolver - looks like that needs to be added to Jim's exclusions list, good catch!

seanluyk commented 6 years ago

@theLinkResolver would you happen to have some sample records of items in the STOR_RCRF location appearing in Blacklight that you could share? Jim has now updated his script, so I want to see if it gets fixed tonight

theLinkResolver commented 6 years ago

@seanluyk Anytime! https://www.library.ualberta.ca/catalog/8223453 https://www.library.ualberta.ca/catalog/8299093 https://www.library.ualberta.ca/catalog/8140473

seanluyk commented 6 years ago

Awesome, thanks @theLinkResolver. While you're at it, any lost-assum items? AFAIK the examples in #890 no longer apply.

theLinkResolver commented 6 years ago

@seanluyk Not at this time. There are a few floating around but none suitable for a test.

BUT I did find this: https://www.library.ualberta.ca/symphony?utf8=%3F&q=1798774

List view indicates there is a BSJ copy, but when you go into the record, the BSJ is (rightly) not in the item table, because it is currently in LOST-ASSUM. It may be desirable to have the presence/absence of locations in list view match those that display in the record view. At some point. :)

seanluyk commented 6 years ago

Interesting find, @theLinkResolver. I've been learning that locations (and other holdings info) are drawn from different places and configured in different files, so I wonder if this is being grabbed from the bib record? I'll add this info to the issue that documents the problem

theLinkResolver commented 6 years ago

@seanluyk could well be. list view might be using the numeric values, which come from the 596.

seanluyk commented 5 years ago

Adding some more info about how records we don't want a user to see are handled in EBSCO discovery:

The 000, or Leader, has a character in the sixth position (05) that for our purposes is either n=new record, c=changed record, or d=deleted record. If I understand correctly, EDS makes use of this. So if a record goes from being visible to being invisible for whatever reason (e.g. its last item gets put into a shadowed location), Jim identifies those and codes the 000/05 as "d". So the record is supplied to EDS but based on their ingest rules, they see the "d" and process that particular record as a deletion. I think this works because the extract happens at midnight or whatever and e-resource discards aren't processed until sometime after 8 AM. So this ensures they get extracted as deletes before they disappear outright. (Other discards get removed once a month)

weiweishi commented 5 years ago

This is a great resource for understanding ingest in Discovery - self-noted capture this in Discovery documentation on di_internal