full dump reporting - Githubissues

shelleydoljack commented 3 weeks ago

We need to have better reporting on how many unique records are in the full dump, possibly how many per file, how many files make up the full dump, how many, if any, duplicates in the materialized view, etc. This could help us with troubleshooting issues, such as POD only getting 5.4 million unique records and duplicates occurring across files.

With previous full dump from Symphony, we'd get an email at sul-unicorn-devs list of counts. Maybe consider this when the full dump selection dag finishes.

shelleydoljack commented 3 weeks ago

In the logs for number_of_records task, we could pull out into an email:

[2024-08-17, 02:04:30 UTC] {full_dump_marc.py:45} INFO - Record count: 9785445
[2024-08-17, 02:04:30 UTC] {python.py:202} INFO - Done. Returned value was: 9785445

In the calculate_start_stop tasks, the logs in the mapped tasks show:

[2024-08-17, 02:09:41 UTC] {full_dump_retrieval.py:103} INFO - Output in calculate_start_stop {'start': 0, 'stop': 2000000}
[2024-08-17, 02:09:41 UTC] {python.py:202} INFO - Done. Returned value was: {'start': 0, 'stop': 2000000}
[2024-08-17, 02:04:40 UTC] {full_dump_retrieval.py:103} INFO - Output in calculate_start_stop {'start': 2000000, 'stop': 4000000}
[2024-08-17, 02:04:40 UTC] {python.py:202} INFO - Done. Returned value was: {'start': 2000000, 'stop': 4000000}

☝️ this might be the cause of some duplication? Shouldn't the next mapped task have start: 2000001 ?

I see in the logs for transform_marc_records_add_holdings task, we get stuff like that might be good to pull out for a reporting email:

[2024-08-17, 04:11:40 UTC] {transformer.py:111} INFO - Writing 4,537 modified MARC records to /folio-data-export-prod/data-export-files/full-dump/marc-files/5000_10000.mrc
[2024-08-17, 04:11:57 UTC] {transformer.py:111} INFO - Writing 4,519 modified MARC records to /folio-data-export-prod/data-export-files/full-dump/marc-files/10000_15000.mrc

For the transform_marc_records_clean_serialize task, this stuff in the logs might also be good to add to an email report. For the serializing and removing fields logging, maybe add the filename and skip pulling out of the log the smart_open call.

[2024-08-17, 05:27:29 UTC] {s3.py:1054} INFO - smart_open.s3.MultipartWriter('folio-data-export-prod', 'data-export-files/full-dump/marc-files/5000_10000.mrc'): uploading part_num: 1, 8711374 bytes (total 0.008GB)
[2024-08-17, 05:27:30 UTC] {transforms.py:131} INFO - Serializing 4537 MARC records as xml
[2024-08-17, 05:27:35 UTC] {s3.py:1054} INFO - smart_open.s3.MultipartWriter('folio-data-export-prod', 'data-export-files/full-dump/marc-files/5000_10000.xml'): uploading part_num: 1, 22124796 bytes (total 0.021GB)
[2024-08-17, 05:27:38 UTC] {transforms.py:102} INFO - Removing MARC fields using AWS S3 with path: /folio-data-export-prod/data-export-files/full-dump/marc-files/10000_15000.mrc
[2024-08-17, 05:27:40 UTC] {transforms.py:107} INFO - Removing MARC fields for 4,519 records

jgreben commented 2 weeks ago

In the meantime I have been running this script I wrote around marccli

#!/opt/homebrew/bin/bash

# First run `brew install marcli`
# Usage: ./marc_record_counts.sh [development|test|stage|prod] [xml.gz|mrc]
for F in `aws s3 ls folio-data-export-${1}/data-export-files/full-dump/marc-files/ | awk '{print $3" "$4}' | egrep -v '^0|^102|^112' |  awk '{print $2}'`
do
    if [[ $F == *.$2 ]]; then
        aws s3 cp --quiet s3://folio-data-export-${1}/data-export-files/full-dump/marc-files/${F} /tmp/${F}
        if [[ $2 == xml.gz ]]; then
            gunzip /tmp/${F}
        fi
        echo -n "Num HRIDs" & marcli -file /tmp/${F} -fields 001 | grep -v '^[[:space:]]*$' | awk '{print $2}' | wc -l
        echo -n "Uniq HRIDs" & marcli -file /tmp/${F} -fields 001 | grep -v '^[[:space:]]*$' | awk '{print $2}' | sort -u | wc -l
        rm /tmp/${F}
    fi
done

sul-dlss / libsys-airflow

full dump reporting #1148