ucldc / rikolti

calisphere harvester 2.0
BSD 3-Clause "New" or "Revised" License
7 stars 3 forks source link

Reharvest collections to pick up date/decade enrichments: non-Nuxeo sources + Nuxeo sources (check w/ campuses) #1107

Open christinklez opened 2 weeks ago

christinklez commented 2 weeks ago
amywieliczka commented 1 week ago
rikolti-prd rikolti-stg
total 2,137,124 records 2,146,265 records
ETLed w/ date data 1,603,517 records (75%) 1,610,558 records (75%)
ETLed w/out date data 533,607 records (25%) 535,707 records (25%)

Honing in on that 25% of records without date data:

rikolti-prd rikolti-stg
w/out date data 533,607 records 535,707 records
w/ version_path 194,201 records (36%) 160,437 records (30%)
w/out version_path 339,406 records (64%) 375,270 records (70%)

Described in collections:

rikolti-prd rikolti-stg
w/out date data 1168 collections 1172 collections
w/ version_path 439 collections (37%) 412 collections (35%)
w/out version_path 729 collections (62%) 760 collections (65%)

So we know which vernacular version was run through the pipeline and published for 37% of published collections missing date data and 35% of staged collections missing date data.

Honing in on that 62-65% of collections without version paths:

rikolti-prd rikolti-stg
w/out version_path 729 collections 760 collections
w/ one vernacular version in s3 3 collections 30 collections
w/ many vernacular versions in s3 726 collections 720 collections

So we can infer which vernacular version was run through the pipeline and published for 3 more published collections and 30 more staged collections because there is only one vernacular version stored in s3.

Nuxeo Analysis

rikolti-prd rikolti-stg
total 2,137,124 records 2,146,265 records
Nuxeo w/out date data 124,089 records (6%) 124,213 records (6%)

Honing in on that 6% of records from Nuxeo and without date data:

rikolti-prd rikolti-stg
Nuxeo w/out date data 124,089 records 124,213 records
Nuxeo w/ version_path 2,335 records 1,921 records
Nuxeo w/out version_path 121,754 records 122,292 records

Described in collections:

rikolti-prd rikolti-stg
Nuxeo w/out date data 303 collections 306 collections
Nuxeo w/ version_path 13 collections 14 collections
Nuxeo w/out version_path 290 collections 292 collections

Honing in on those 290 Nuxeo collections without version paths:

rikolti-prd rikolti-stg
Nuxeo w/out version_path 290 collections 292 collections
Nuxeo w/ one vernacular version in s3 0 collections
Nuxeo w/ many vernacular versions in s3 290 collections

So we can't infer a version path for any of the 290 collections without version paths.