psu-libraries / psulib_traject

Penn State University Libraries' Blacklight Catalog Traject Indexer
Apache License 2.0
2 stars 0 forks source link

Full extracts from Sirsi #25

Closed banukutlu closed 5 years ago

banukutlu commented 6 years ago

In GitLab by @bzk60 on Aug 24, 2018, 09:44

Required by #4 Required by #5

banukutlu commented 6 years ago

In GitLab by @bzk60 on Aug 27, 2018, 09:55

https://staff.libraries.psu.edu/cataloging-and-metadata-services/about-us/teams-and-groups/cataloging-expert-team/tag-lists

Per Ruth

The "dump junk tag list" is one we need to be sure isn't being applied when we export MARC records for the BlackCat. 59x and 9xx are both local control tags, which we may want to display (I'll double check any aren't internal). The other ones we definitely want to consider for indexing. Some, like the 090, we won't. Others, like 541, we will.

banukutlu commented 6 years ago

In GitLab by @mxk128 on Aug 30, 2018, 15:56

Did a full extract and came back with 7 errors our of 7+ million records.

Errors were

**holding_tags table exceeded MAXITEMS

**Record exceeds maximum export size

I put in a ticket with sirsi asking if is there a way to find out which catkeys resulted in the 7 errors. Their response

"We were only able to find two of the errored records. They were the records causing the "holding_tags table exceeded MAXITEMS" errors. They are Title Control numbers s0727-4181 with 12637 items and s0196-1497 with 10239 items.You can find these records and update them individually and just ignore the errors every time you run a catalog dump or split these catalog record into smaller catalog records."

So we asked them what do they mean by "just ignore". Does this mean they are included in the dump?

banukutlu commented 6 years ago

In GitLab by @bzk60 on Aug 31, 2018, 12:03

per @tbd3

Have the script working now on the Symphony server to move updates over to the QA server. Need to put scripts on QA next to pick up the files and index them into solr.

banukutlu commented 6 years ago

In GitLab by @bzk60 on Aug 31, 2018, 13:48

https://github.com/sul-dlss/searchworks_traject_indexer/blob/master/bin/index_sirsi.sh

banukutlu commented 6 years ago

In GitLab by @mxk128 on Sep 4, 2018, 11:53

I did a solr search on the 2 catkeys that sirsi could retrieve (2168931 and 109596) and they were not in solr. So looks like they are not added to the catalog dump.

banukutlu commented 6 years ago

In GitLab by @bzk60 on Sep 4, 2018, 17:12

please see https://github.com/traject/traject/blob/master/doc/batch_execution.md

banukutlu commented 6 years ago

In GitLab by @bzk60 on Sep 4, 2018, 17:25

per @tbd3

Got scripts finished on Symphony to extract index updates throughout the day. Wrote scripts on QA to index these. Ready to test on QA.

banukutlu commented 6 years ago

In GitLab by @rkt6 on Sep 11, 2018, 12:00

Per today's conversation, we already have a checkbox noting to "ensure everything not shadowed is extracted" -- we need to be sure this includes SerialsSolutions records for now.

banukutlu commented 6 years ago

In GitLab by @mxk128 on Sep 12, 2018, 11:33

@tbd3 The script to move edited records to Blackcat (in cron on Symphony side) is currently going to /dev/null. It needs to be logged so we can check if the script ran ok and check what errors we got. We can discuss where we want them to go.

banukutlu commented 6 years ago

In GitLab by @tbd3 on Sep 12, 2018, 13:11

That's just the cron job going to /dev/null. The log is being written in the script that cron calls. It goes to /prodat/psulinkages/blackcat/move_edited_records_to_blackcat_log.date.

banukutlu commented 6 years ago

In GitLab by @mxk128 on Sep 12, 2018, 13:35

That dir only has the files with the the deletes and the marc. The logs I'm referring to is the error log. Like what we get at the top of sirsi report outputs.

banukutlu commented 6 years ago

In GitLab by @bzk60 on Sep 12, 2018, 13:45

since we are talking about logs - also traject output should be going to /var/log/traject/ directory and the traject command to run is:

bundle exec traject -s log.file=/var/log/traject/traject.log -s log.error_file=/var/log/traject/traject_error.log -c psulib_config.rb /home/ansible_deploy_bot/psucat_20180827.mrc

banukutlu commented 6 years ago

In GitLab by @cdm32 on Sep 26, 2018, 13:52

Why do we need to run a "full extract" and index it every day? Why wouldn't we just run updates and deletes?

banukutlu commented 6 years ago

In GitLab by @ tbd3 on Sep 26, 2018, 14:08

@cdm32 We will only run the full extract and full index as needed. The deltas will keep our indexes up to date. In a perfect world, we would only run the full once and just run deltas from that point forward. Why did you think we would run the full extract every day?

banukutlu commented 6 years ago

In GitLab by @bzk60 on Sep 26, 2018, 14:19

Yes we will run the fulls as needed. Just wanted to clarify full extract script will not run a full index - we will take care of the full indexing with ansible.

banukutlu commented 6 years ago

In GitLab by @cdm32 on Sep 26, 2018, 15:39

I guess I misread the title of this issue "Full and Daily extract from Sirsi for Blackcat"

banukutlu commented 6 years ago

In GitLab by @cdm32 on Sep 26, 2018, 15:40

Are "deltas" what you are calling all changes and deletes or just changes? Just want to understand the jargon and make sure I try to use the right terminology.

banukutlu commented 6 years ago

In GitLab by @ tbd3 on Sep 26, 2018, 16:20

@cdm32 By deltas, that means all changes .. adds updates and deletes.

mkutch commented 5 years ago

@ruthtillman Jeff and Maryam discussed which field to use to export holdings. Originally, 949 was thought to be best, but since there are millions of 949s in The CAT containing legacy data from the LIAS to Sirsi migration (ca., 2001) as well as hundreds of thousands of 949 fields in Bibloaded records (an ongoing practice since the 949 field is used to populate data on the Call Number/Item tab), the 999 field is probably a better choice.

ruthtillman commented 5 years ago

@mkutch as long as we document the change, that's fine by me.

mkutch commented 5 years ago

I figured out how to edit the raw marc extract, to delete the 949, before sending over the traject for indexing. This is good because then we won't be passing legacy data/dupes over.

banukutlu commented 5 years ago

@mkutch can you pls document that why we will go with 999 over 949 here: https://github.com/psu-libraries/psulib_traject/wiki/Full-and-Incremental-Extracts-from-Symphony

mkutch commented 5 years ago

@ruthtillman I noticed that some records with the following item types which should be shadowed and the item, call or bib level were not being shadowed.

CARRELKEY, EBOOKREADR, EQUIP14DAY, EQUIP24FEE, EQUIP24HR, EQUIP3DAY, EQUIP4HR, EQUIP5DAY, EQUIP7DAY, ILL, LAPTOP, PALCI

Talked with Chris H. and he said that campus' don't always follow rule of shadowing these kind of items. see catkey 14082248 for instance. So I was thinking of excluding them from the extract.

ruthtillman commented 5 years ago

@mkutch definitely! Assuming that Chris H., et. al., indeed don't want them in the public Cat display, then we should exclude bibs which only have items in those locations from the extract/treat them as additional shadowed locations/however that's smoothest.

mkutch commented 5 years ago

Chris agrees that these should be shadowed however he will talk to others (campus, spec collection et al.) to get a consensus and he will let me know. In the meantime I have excluded those item types.

banukutlu commented 5 years ago

@mkutch does the new extract you are running uses 999 for holdings? Just trying to figure if that work is completed. If so let's mark the sub-task as completed.

mkutch commented 5 years ago

The 949 will be used for the holdings. When you do a dump in Sirsi and specify the holdings to go to 949 field, it will remove any legacy (any 949 in the bib) and puts the current holdings in the 949.

By excluding junktags in the dump option they will be removed there also.

banukutlu commented 5 years ago

we are closing this issue since we have the full extract working. other required issues will be handled within their issues.