google / patents-public-data

Patent analysis using the Google Patents Public Datasets on BigQuery
https://bigquery.cloud.google.com/dataset/patents-public-data:patents
Apache License 2.0
531 stars 162 forks source link

Large decrease in number of OCID annotations #90

Open ravwojdyla opened 6 months ago

ravwojdyla commented 6 months ago

We have observed a large decrease in number of OCID annotations available in the recent Google Patents public data. We specifically consume the OCID associated with patents, so I will focus on that here. It appears that large number of patents that used to be annotated with OCIDs of specific entities (in our case genes), are no longer annotated by those OCIDs.

To give one specific example, if we take STAT1/ENSG00000115415 OCID:102100019657 and application US-201816499393-A, previous release had 32 OCID IDs associated with this application:

OCIDs

102100004941 102100002816 102100017157 102100004159 102100016295 102100020485 102100017509 102100019667 102100019388 102100008658 102100018913 102100015895 102100017329 102100019517 102100009637 102100000197 102100008614 102100005617 102100016662 102100009641 102100017996 102100019657 102100003514 102100017933 102100009664 102100019816 102100015722 102100017932 102100019099 102100012464 102100010255 102100002212

The most recent release doesn't have any, missing STAT1 annotation completely even though it's clearly in the text of the patent. Further if we count unique patents annotated with STAT1 OCID over time:

image

In the most recent public data there appears to be half as many publications with STAT1 annotations. Is there any specific reason for this?

Could be related to https://github.com/google/patents-public-data/issues/88

ravwojdyla commented 6 months ago

👋 @wetherbeei, in the past your help in https://github.com/google/patents-public-data/issues/54#issuecomment-937980664 was invaluable, I wonder if you have any immediate thoughts or recommendation on this issue? Thank you in advance.