Open Juan-Mateos opened 5 years ago
Issues found during data exploration:
ipc
, nace2
, tech_field
are almost completely missing (~100% of values are np.nan
)Which dataframe?
Observations pulled from notebooks in https://github.com/nestauk/patent_analysis/tree/90818d26a083af91fa5a37b355deb5ab546d6ee1/notebooks :
{'person_ctry_code': 'GB',
'earliest_filing_year': 2010,
'database': 'patstat_2019_05_13'}
person_appln
table
han_name
appears to be missing universities
E.g. psn_name
is 'UNIVERSITY OF CAMBRIDGE' but han_name
is 'CAMBRIDGE ENTERPRISE LTD'.
~54% (297K entries) of person_address
missing
person_address
entries for the same psn_name
but a different applicationpsn_name
's with missing addressesOften person_address
is just a city/town
Many names have multiple addresses
the link between person_id
and person_name
is unclear
appln
Table
earliest_filing_year
<= earliest_publn_year
appln_abstract
table
appln_ipc
table
ipc_version
has different dates - what does it mean for IPC data to be available at different pointsipc_class_level
almost exclusively "A" (full IPC codes)
This might include some questions and comments for @jaklinger and @russwinch