sul-cidr / patent_data_extractor

1 stars 3 forks source link

Wrong PKs in grant config? #13

Closed simonwiles closed 4 years ago

simonwiles commented 4 years ago

I've just completed a full run of all the grants documents from 2002-2019. It spat out a sqlite file which is 53.8Gb (without indexes of any kind -- basic PK and FK constraints add ~20% to the file size) and contains a total of 288,841,106 records in 8 tables, for a total of 4,616,172 document processed.

In doing some basic analysis to check all's well, however, I seem to have found a problem. In all the config files pertaining to grants, the primary key for the patent table is sourced from the application-reference section of the XML record (SDOBI/B200/ for the pre-2005 files). Unfortunately, it seems that this value does not uniquely identify a grant document, as multiple documents may reference the same application. Rather, I think, the primary key should be taken from the publication-reference (SDOBI/B100/ for the pre-2005 files), as effected by this PR.

@gmoore016 -- could you take a look at this please and give me your opinion?

gmoore016 commented 4 years ago

Sorry, just saw this! I'll take a look first thing tomorrow.

gmoore016 commented 4 years ago

@simonwiles This is all coming along great! It's strange we're running into duplicates. Definitely go ahead and make the change for now; do you have any examples of observations with duplicate application-references? I may want to look into those and ensure they aren't errors in the data.

simonwiles commented 4 years ago

@gmoore016 I'll email you with a full breakdown.