ualbertalib / discovery

Discovery is the University of Alberta Libraries' catalogue interface, built using Blacklight
http://search.library.ualberta.ca
12 stars 3 forks source link

how else can we validate 6M records? #1244

Open pgwillia opened 6 years ago

pgwillia commented 6 years ago

Some automation exists, but can it be improved?

There's part of an ansible playbook

[nmacgreg@its004nm2 solrcloud_collections]$ time ansible-playbook -i ../../inventory-prod/solrcloud_collections solrcloud_collections.yml --extra-vars "action=validate new_collection=$NEW_COLLECTION"
...snip...
TASK [solrcloud_collections : Compare the number of records in discovery-prod-2018-08-13] ***
ok: [solr-prod1 -> localhost] => {
"failed": false,
"failed_when_result": false,
"msg": "There are 6105291 records, and that's ok"
}

There's automated tests that are run against search-test from cardiff which have this expectation

https://search-test.library.ualberta.ca/symphony?q=shakespeare  ( 15000 )
https://search-test.library.ualberta.ca/databases?q=shakespeare ( 10 ) 
https://search-test.library.ualberta.ca/journals?q=shakespeare  ( 10 )

But Third, RE:

"...a search I do "global"... the top 10 records it brings back.  The often appear in different order but the basic records are usually the same "

It would be trivial to compose an additional test that performs the same search, compares the top {10|100|1000|all} results against a set of expected titles... using solr directly, or through the Discovery web interface... or compares the old index against the new index. Easy to stack more "interesting" searches into this.

NEW IDEAS WELCOME: how else can we validate 6M records?

In context, both the previous month's collection, and the new collection, are typically online together for at least 24 hours, so one idea would be to pull random records from one index, and report on whether it's found in the other index.. since Solr is so fast, we could probably compare a large percentage of the index in short order. 
What's available in the Symphony WebServices API?  Could we programmatically pull random records from Solr & compare against "the original source", WS? And vice-versa?

From @nmacgreg email [New Solr Collection: August EBSCO extract]

pgwillia commented 6 years ago

I raised this with the Sirsi team and we came up with one possible scenario. We have a set up records in this data set dealing our HathiTrust commitment Basically print books that will never be discarded. Therefore if we can zero in on this set up records, the number should always be the same every month or a small increase as we are adding to this collection. It should never decrease.

The raw marc looks like this:

583: : committed to retain|c20170930|d20421231|fHathiTrust|uhttps://www.hathitrust.org/shared_print_program|5AEU|zHathiTrust Shared Print commitment 2017

Interesting thought, but looking at the ingest mapping file (symphony_ingest.properties) I don't think that field is in the index.