pulibrary / static-tables

Searchable index of Marquand Auction Catalogs
https://library.princeton.edu/marquand_catalogs/
MIT License
0 stars 0 forks source link

[Princeton Undergraduate Alumni Index, 1921-2012] Investigate record number discrepancy #171

Closed maxkadel closed 1 month ago

maxkadel commented 2 months ago

What maintenance needs to be done?

Investigate record number discrepancy. From stakeholder:

The new db reports having 88,313 records, the old db reports having 93, 010.

Level of urgency

Why is this maintenance needed?

Acceptance criteria

Implementation notes, if any

There was a ticket to remove duplicate rows from this index, described in https://github.com/pulibrary/static-tables/issues/142

We noticed a discrepancy between the numbers removed in https://github.com/pulibrary/static-tables/issues/142 and the discrepancy that @LynnDurgin references. Check the number of rows in the source table from mudd-dbs and figure out why instead of 9,000 less we're seeing 5,000 less records.

maxkadel commented 1 month ago

I think the change of 9,000 rows was a typo in the previous ticket.

The original data contained a lot of duplicates. When I repeated the deduplication with last name, first name, year, pubfile, and academicfile, I got the same exact result as the previous deduplication. It seems like there were some duplicates that were added over the years in the old database. No unique data has been lost.