sul-dlss / libsys-airflow

Airflow DAGS for migrating and managing ILS data into FOLIO along with other LibSys workflows
Apache License 2.0
5 stars 0 forks source link

cleanup orphaned SRS and/or re-link instances to SRS #838

Open shelleydoljack opened 9 months ago

shelleydoljack commented 9 months ago

Probably add to the monthly "run these tasks" DAG (to be coded).

On single record imports, something happens and the instance is source=MARC but the SRS is missing or not connected to the instance anymore. Current solution is to change source=FOLIO for instance and have user overlay source bib to get it right. Document of missing SRS instances: https://docs.google.com/document/d/1_O1_Hx44q7o9Qmm_HvF7DNugAcDzUMZ8GYIEuT2v0zI/edit?usp=sharing

Scripts 2. Scripts for identifying invalid InstanceIds in SRS and 10. Scripts to remove SRS MARC Bib records with no InstanceIds from the wiki Scripts for Inventory, Source Record Storage, and Data Import Cleanup have some solutions that we might want to replicate in a DAG.

ryanmax commented 9 months ago

Below is a MetaDB query to pull instance hrids lacking SRS links. Current count: 448

select
    it.hrid
from
    folio_inventory.instance__t__ it
where
    it."__current"
    and it."source" = 'MARC'
except
select
    rl.external_hrid as hrid
from
    folio_source_record.records_lb__ rl
where
    rl."__current"
    and rl.record_type = 'MARC_BIB'
    and rl.state = 'ACTUAL'
order by
    hrid asc;
shelleydoljack commented 6 months ago

The current issue we are finding is not that the SRS in marc_records_lb or raw_records_lb is missing the 999$i but rather that the records_lb table does not have state=ACTUAL given a distinct external_id or external_hrid (they all are state=OLD and this seems to be the root of the problem with "View Source" greyed out or DI to overlay not working).

shelleydoljack commented 6 months ago
select count(distinct external_hrid) from sul_mod_source_record_storage.records_lb where state = 'ACTUAL';
  count   
----------
 11311539

select count(distinct external_id) from sul_mod_source_record_storage.records_lb where state = 'ACTUAL';
  count   
----------
 11311539

select count(distinct external_hrid) from sul_mod_source_record_storage.records_lb where state = 'OLD';
 count  
--------
 123600

FOLIO inventory facet source=MARC shows 10,349,398. Not sure why we have 1 million more with state = ACTUAL. Maybe the elasticsearch index is off?

shelleydoljack commented 6 months ago
select count(*) from sul_mod_inventory_storage.instance where jsonb ->> 'source' = 'MARC';
  count   
----------
 10349812

So we are potentially missing 414 records?

shelleydoljack commented 6 months ago

Trying a query that would give us instance records with source = MARC but there is no SRS yields 62 results. @ahafele should these have SRS (I see some are on-order but then some look wrong maybe, like 1e5926c8-a110-417f-8c59-c5c1cb393880):

             instance_id              | instance_source | srs_state 
--------------------------------------+-----------------+-----------
 ea09b1e1-b549-43a0-8f65-975d2eec6c74 | MARC            | 
 92090026-18e8-4d04-872f-d97b7b425cd3 | MARC            | 
 b0ab8998-90bc-4353-835b-4376507e0500 | MARC            | 
 ccc7da59-17ac-4934-8b63-4ca8917c5906 | MARC            | 
 992857ff-3776-4fd6-a08e-3fe48ab770bf | MARC            | 
 a274bb3a-62e9-4ee4-9562-1e27943012d8 | MARC            | 
 4bf0cb67-ff8b-4c4b-94d9-f2db96181ae5 | MARC            | 
 01ee2d3d-564e-45a7-a725-ae9e7c49c493 | MARC            | 
 bb9d9610-829b-4cdb-b3fb-f8c252dd3c19 | MARC            | 
 751ca890-5fac-416f-9465-41067f6ed9fb | MARC            | 
 76b7eebd-2b8c-4192-aa6e-1ff9a084ed20 | MARC            | 
 ff8720de-3e30-4871-a6eb-64bab5da24be | MARC            | 
 5e7aad1c-0438-4819-aee1-a053ee66fb7f | MARC            | 
 678316b9-ccb6-40f1-a09f-7235589d4dc6 | MARC            | 
 81ec3137-2a07-4d49-9e06-cb26d0a5f712 | MARC            | 
 8d132390-0b26-4760-955d-6d3865cf2c35 | MARC            | 
 1e5926c8-a110-417f-8c59-c5c1cb393880 | MARC            | 
 c678a358-417f-4914-8ca4-500429778e08 | MARC            | 
 3d32fe56-69bb-4feb-9608-11972bea0961 | MARC            | 
 63996a33-ab64-4357-acb7-8d0a421bcb30 | MARC            | 
 da983b6e-464c-4320-a295-c8133384ba14 | MARC            | 
 f6ee75af-924f-4b4a-b844-8bfdcdd2fd7b | MARC            | 
 8a6b3dbf-8b5a-43ba-8ca3-dc89b42c0796 | MARC            | 
 2e467c2c-860d-4975-812b-b41661c3667b | MARC            | 
 bfeb3491-d969-4f1e-aad9-2ff5e8ca6cbb | MARC            | 
 7a650ce0-79f4-4121-9167-3439852c67eb | MARC            | 
 5cb35b5f-bc4d-45e8-8378-c47840126ac8 | MARC            | 
 22a7e3a9-24a2-43cb-8a9b-32849b6348f2 | MARC            | 
 5924fda2-8340-4d36-b10f-db4e7082f4d9 | MARC            | 
 ba16ca75-01c7-4538-bc64-6f655a34b4e0 | MARC            | 
 ac862dd6-bd2c-4e78-969f-64daace1cfa0 | MARC            | 
 9988eb9d-cf0a-4142-b049-a50bbe4d3b5a | MARC            | 
 550cc36b-d4fd-4b58-8f42-4649b8822f85 | MARC            | 
 5e672256-8553-467d-9905-62fc0a66128f | MARC            | 
 5093c672-d681-4215-9240-04f23ae6285d | MARC            | 
 80baf139-2d33-444a-b954-07e23f75223d | MARC            | 
 ec30ef0f-e75f-423d-a6d6-b1e9f5be70ad | MARC            | 
 410eae34-13e0-4545-98fa-810faf83dffa | MARC            | 
 5723a170-b33e-4944-b483-c74d14e04664 | MARC            | 
 4d46f1bc-0bc5-4600-b3d5-ec8809ddb2ff | MARC            | 
 2dce8a79-7f55-40e9-962c-eaa7bf56cb06 | MARC            | 
 a3e044c5-3508-4a40-8f4b-c08c5595547a | MARC            | 
 f779e02d-01d5-4bc4-b273-cd6c0089f1e2 | MARC            | 
 703bcaff-e8a4-48f9-8cf0-269bdc1c6a4c | MARC            | 
 7cf3a7df-9f48-4979-b7dc-2a628aa4a822 | MARC            | 
 06eccb3d-e7c8-46ae-9abf-ebb0af350881 | MARC            | 
 3dbe71c4-a842-47a4-b0f8-da6d54aad418 | MARC            | 
 29a46e65-98f4-4934-bb8f-863f24a3251d | MARC            | 
 2903e1ef-117c-43b6-9537-0aa2fbdbd822 | MARC            | 
 3f3668fa-cf06-4e68-95c2-6981f1e19db4 | MARC            | 
 9c23ec76-9ee2-4165-bea8-a36ce47885ce | MARC            | 
 ded0b8b3-3ccb-4eda-841f-2dd7f2a07ae0 | MARC            | 
 979a6690-69cd-4a1e-908d-47a31fe25337 | MARC            | 
 1166da9c-77a9-4500-badf-d1cadafb64e1 | MARC            | 
 409ec1b5-f1c0-4d83-a66b-aded85dd3353 | MARC            | 
 11bf23c4-82a8-4466-96cf-993a81310a3c | MARC            | 
 993bd7c3-ac46-49da-bfed-0413ec18f09e | MARC            | 
 4dfcc767-5006-4ca4-9805-c0f46beacfce | MARC            | 
 d437ff7e-f9e9-4bd2-be5d-c23feaedbc66 | MARC            | 
 510fa7ff-24a0-4e0a-b16b-04d65b3f376a | MARC            | 
 7f7cf99b-26b9-45d6-a160-9647d6ab4f42 | MARC            | 
 6cc416ea-1d4a-4bf0-b504-ab9afd34de68 | MARC            | 
shelleydoljack commented 6 months ago

This is the query:

select instance.id as instance_id,
jsonb_extract_path_text(instance.jsonb, 'source') as instance_source,
srs.state as srs_state
from sul_mod_inventory_storage.instance as instance
left join sul_mod_source_record_storage.records_lb as srs on srs.external_id = instance.id
where jsonb_extract_path_text(instance.jsonb, 'source') = 'MARC'
and srs.state is null;
ahafele commented 6 months ago

I spot checked some of these and many are Springer vendor records e.g. the example you give above (1e5926c8-a110-417f-8c59-c5c1cb393880) - see this ticket for related information. We could have someone check all of these but my guess is that they should all have SRS records.

shelleydoljack commented 6 months ago

instances_missing_SRS.json I grabbed the instance json for each of these in case it is helpful.