Closed simonwiles closed 4 years ago
Sorry, just saw this! I'll take a look first thing tomorrow.
@simonwiles This is all coming along great! It's strange we're running into duplicates. Definitely go ahead and make the change for now; do you have any examples of observations with duplicate application-reference
s? I may want to look into those and ensure they aren't errors in the data.
@gmoore016 I'll email you with a full breakdown.
I've just completed a full run of all the grants documents from 2002-2019. It spat out a sqlite file which is 53.8Gb (without indexes of any kind -- basic PK and FK constraints add ~20% to the file size) and contains a total of 288,841,106 records in 8 tables, for a total of 4,616,172 document processed.
In doing some basic analysis to check all's well, however, I seem to have found a problem. In all the config files pertaining to grants, the primary key for the
patent
table is sourced from theapplication-reference
section of the XML record (SDOBI/B200/
for the pre-2005 files). Unfortunately, it seems that this value does not uniquely identify a grant document, as multiple documents may reference the same application. Rather, I think, the primary key should be taken from thepublication-reference
(SDOBI/B100/
for the pre-2005 files), as effected by this PR.@gmoore016 -- could you take a look at this please and give me your opinion?