sul-dlss / FOLIO-Project-Stanford

Task management for Stanford’s analysis of FOLIO.
2 stars 0 forks source link

POD data QA #637

Closed ahafele closed 1 month ago

ahafele commented 2 months ago

I ran the select_pod_records DAG with the following results

ahafele commented 2 months ago

DAG is failing on transform step - https://sul-libsys-airflow-dev.stanford.edu/dags/select_pod_records/grid?run_id=manual__2024-05-10T15%3A34%3A09%2B00%3A00&execution_date=2024-05-10+15%3A34%3A09%2B00%3A00&tab=graph&dag_run_id=manual__2024-05-10T15%3A34%3A09%2B00%3A00&task_id=transform_folio_marc_record I noticed in the log for this error it says

INFO - Writing 20 modified MARC records to /opt/airflow/data-export-files/pod/marc-files/updates/202405081947.mrc

this filename is for 0508 but I tried to run this today 0510

jgreben commented 2 months ago

Fixed by https://github.com/sul-dlss/libsys-airflow/pull/977

ahafele commented 2 months ago

@jgreben @jermnelson I am seeing 999 data now, but the records seem to be written all to the same file. When I run the select_pod dag no new file is generated for download, but this file grows each time 202405071925.mrc. The log for transform marc references this filename.

999 - All subfields are not present but the 999s seem to be duplicated in each record. Maybe due to how the files aren't being written correctly.

example: This record should have 3 999s (1 folio and 2 with holdings/item data) but instead has

=999 ff$i5881758a-0a68-57f0-8a63-f591de7159dc$sb0b6df22-0574-5519-9c3d-1f60f59706a9 =999 \$aG6071 .P2 1951 .J6 NORTHERN SHEET$hMap$lEAR-MAP-CASES$wLibrary of Congress classification$tbook$eEAR-MAP-CASES$j1 =999 \$aG6071 .P2 1951 .J6 SOUTHERN SHEET$hMap$lEAR-MAP-CASES$wLibrary of Congress classification$tbook$eEAR-MAP-CASES$j1 =999 \$aG6071 .P2 1951 .J6 NORTHERN SHEET$hMap$lEAR-MAP-CASES$wLibrary of Congress classification$tbook$eEAR-MAP-CASES$j1 =999 \$aG6071 .P2 1951 .J6 SOUTHERN SHEET$hMap$lEAR-MAP-CASES$wLibrary of Congress classification$tbook$eEAR-MAP-CASES$j1 =999 \$aG6071 .P2 1951 .J6 NORTHERN SHEET$hMap$lEAR-MAP-CASES$wLibrary of Congress classification$tbook$eEAR-MAP-CASES$j1 =999 \$aG6071 .P2 1951 .J6 SOUTHERN SHEET$hMap$lEAR-MAP-CASES$wLibrary of Congress classification$tbook$eEAR-MAP-CASES$j1 =999 \$aG6071 .P2 1951 .J6 NORTHERN SHEET$hMap$lEAR-MAP-CASES$wLibrary of Congress classification$tbook$eEAR-MAP-CASES$j1 =999 \$aG6071 .P2 1951 .J6 SOUTHERN SHEET$hMap$lEAR-MAP-CASES$wLibrary of Congress classification$tbook$eEAR-MAP-CASES$j1 =999 \$aG6071 .P2 1951 .J6 NORTHERN SHEET$hMap$lEAR-MAP-CASES$wLibrary of Congress classification$tbook$eEAR-MAP-CASES$j1 =999 \$aG6071 .P2 1951 .J6 SOUTHERN SHEET$hMap$lEAR-MAP-CASES$wLibrary of Congress classification$tbook$eEAR-MAP-CASES$j1 =999 \$aG6071 .P2 1951 .J6 NORTHERN SHEET$hMap$lEAR-MAP-CASES$wLibrary of Congress classification$tbook$eEAR-MAP-CASES$j1 =999 \$aG6071 .P2 1951 .J6 SOUTHERN SHEET$hMap$lEAR-MAP-CASES$wLibrary of Congress classification$tbook$eEAR-MAP-CASES$j1

jgreben commented 2 months ago

@ahafele I think what is going on is that all of the marc files in the various vendor folders are getting re-transformed for each new dag run because the transmission tasks are not running and the files are not being archived from the vendor/marc-files directories. I just looked at a couple of files from dags that I triggered just now: west/updates/202405131852.mrc and pod/updates/202405131910.mrc and it looks like there are no duplicated 999s. Only the previous files have duplicates. Can you verify?

ahafele commented 2 months ago

@jgreben I just triggered a POD dag run and had the same result (records written to the 0507 file). Looking at the file you reference - pod/updates/202405131910.mrc - the 999s are looking better but the records are duplicated x3 in the file.

ahafele commented 2 months ago

Also 999 for records with just a holdings record and no item record are not being generated correctly. Req is

1 new 999 for each holdings/item combo If no item records, 999 with holdings data only

ahafele commented 1 month ago

Ran a test for POD selection and I would have expected to find the following in the update file -

a3929970 - updated date for today - nothing suppressed
a4798645 - updated date for today - has print holdings and no item
a6743952 - holdings suppressed but not item

All three include symphony 999s

jgreben commented 1 month ago

We figured out that the SQL was still not correct after the last change of looking for the updatedDate from the instance record. We needed to stop looking for a "marc generation of greater then 0" because in these cases the marc record was not updated at all.

jgreben commented 1 month ago

Now fixed by #1006

ahafele commented 1 month ago

Testing results: a3929970 - updated date for today - nothing suppressed Now found in file a4798645 - updated date for today - has print holdings and no item Still not in file. Electronic records with holdings and no items are included so I'm not sure why this one is not a6743952 - holdings suppressed but not item still not in file but I think that is good

ahafele commented 1 month ago

I did another test of a record with print holdings and no item but this time with a bound-with relationship and it works - a6829669 - presumably because of the associated principle item record. I'd still like to understand how this is working but I think it might be OK.

jgreben commented 1 month ago

I am still seeing the updatedDate for those HRIDs still as 5/20/2024, and the parameters for the latest pod run was {'from_date': '2024-05-22', 'to_date': '2024-05-23'} Maybe what you did to update the record somehow didn't take, or did not effect the updatedDate for the instance record?

ahafele commented 1 month ago

I changed the instance updateDate for a4798645 and a6743952 to today 5/23 and reran the POD selection dag and the generated update file is empty.

jgreben commented 1 month ago

This should be fixed now by https://github.com/sul-dlss/libsys-airflow/pull/1016

ahafele commented 1 month ago

Fixes

I changed the instance updateDate for a4798645 and a6743952 to today 5/23 and reran the POD selection dag and the generated update file is empty.

New problem When an item is suppressed the 999 with item info is still included. When an item and holding is suppressed the 999 is still included. This was previously working as expected.

Example a9615278 includes =999 \$aHC285 .C27 2011 test$hBook$lGRE-STACKS$wLibrary of Congress classification$tportable device 1$eGRE-STACKS =999 \$aHC285 .C27 2011$hBook$lGRE-STACKS$wLibrary of Congress classification$tbook$eGRE-STACKS$j1 even though the first one is suppressed from discovery.

Example a7946988 includes =999 \$aHV6773 .C475 2009$hBook$lGRE-STACKS$wLibrary of Congress classification$tbook$eGRE-STACKS$j1 Even though both item and holding are suppressed from discovery.

ahafele commented 1 month ago

@jgreben most 999s are now gone from the generated files. I can show after standup this morning.