google / patents-public-data

Patent analysis using the Google Patents Public Datasets on BigQuery
https://bigquery.cloud.google.com/dataset/patents-public-data:patents
Apache License 2.0
539 stars 163 forks source link

Dataset lacking cited_by data even though its available on the website. #69

Closed alekssadowski95 closed 2 years ago

alekssadowski95 commented 2 years ago

There are a lot of patents that dont have any cited_by data in patents-public-data.google_patents_research.publications even though the data is available on the website. Has anyone also tried to get that data and has a solution to this?

Here is an example: Google Patents website Document with publication number 'US-9824690-B2' does have 80 cited by entries on the website https://patents.google.com/patent/US9824690B2/en?q=makerbot&assignee=Makerbot+Industries%2c+Llc

grafik

Google Cloud BiqQuery But there are none in the patents-public-data.google_patents_research.publications dataset.

SELECT publication_number, cited_by
FROM `patents-public-data.google_patents_research.publications` 
WHERE publication_number = 'US-9824690-B2'
LIMIT 50

grafik

Can anyone explain this?

wetherbeei commented 2 years ago

The citations on patents.google.com for this document all come from "family" citations (citations to another publication in the same family as US-9824690-B2). US-9824690-B2 itself is not cited by any other patent publication.

The family_id field will let you aggregate all citations to a family with a few JOINs.

See for info on families: https://www.epo.org/searching-for-patents/helpful-resources/first-time-here/patent-families.html

alekssadowski95 commented 2 years ago

@wetherbeei Thank you for your quick answer. I have not known about how these family citations work and still dont fully understand them. Do those family publications actually cite this publication or is this "citation" just generated for this database?

wetherbeei commented 2 years ago

For example the third family citation listed on US-9824690-B2 is US20140074272A1. If you look at the citations of US20140074272A1 it cites US9108450B2 which is a family member of US9824690B2 (see the "Worldwide applications" list of country codes for all of the family members).

SELECT publication_number FROM ( SELECT family_id FROM patents-public-data.patents.publications WHERE publication_number = "US-9824690-B2" ) pubs LEFT JOIN patents-public-data.patents.publications family_pubs ON family_pubs.family_id = pubs.family_id

Row publication_number  
1 US-9824690-B2  
2 US-9349374-B2  
3 US-9108450-B2  
4 US-2013292881-A1  
5 US-2015348557-A1  
6 US-2013297320-A1  
7 US-9404200-B2  
8 US-2016260432-A1

This will select 126 citing publication numbers. On patents.google.com we de-duplicate by the citing publication's family_id when two publications (e.g. the US and EP version) cite the same publication.

SELECT pubs.family_id, family_pubs.publication_number AS family_publication, citations.citing_pub_number FROM ( SELECT family_id FROM patents-public-data.patents.publications WHERE publication_number = "US-9824690-B2" ) pubs LEFT JOIN patents-public-data.patents.publications family_pubs ON family_pubs.family_id = pubs.family_id LEFT JOIN ( SELECT pubs.publication_number AS citing_pub_number, cite.publication_number AS cited_pub_number FROM patents-public-data.patents.publications pubs, UNNEST(citation) AS cite ) AS citations ON family_pubs.publication_number = citations.cited_pub_number

Row family_id family_publication citing_pub_number  
20 49511931 US-2013297320-A1 US-10803501-B1  
21 49511931 US-2013297320-A1 CN-110570847-A  
22 49511931 US-2013297320-A1 CN-108656558-A  
23 49511931 US-2013297320-A1 US-11004126-B1  
24 49511931 US-2013297320-A1 CN-107521088-A  
25 49511931 US-2013297320-A1 US-10979581-B2  
26 49511931 US-2013297320-A1 US-9349374-B2  
27 49511931 US-2013297320-A1 US-2016176118-A1  
28 49511931 US-2013297320-A1 US-2022124208-A1  
29 49511931 US-2013297320-A1 US-10079016-B2  
30 49511931 US-2013297320-A1 US-2015205544-A1  
31 49511931 US-2013297320-A1 US-10229679-B1  
32 49511931 US-2013297320-A1 US-11176935-B2  
33 49511931 US-2013297320-A1 US-2020133594-A1  
34 49511931 US-2013297320-A1 US-10460342-B1  
35 49511931 US-2013297320-A1 US-10929904-B1  
36 49511931 US-2013297320-A1 US-10545481-B2  
37 49511931 US-2013297320-A1 US-10836110-B2  
38 49511931 US-2013297320-A1 US-2019068809-A1  
39 49511931 US-2013297320-A1 US-2015197064-A1  
40 49511931 US-2013297320-A1 US-11283943-B2  
41 49511931 US-2013297320-A1 US-2018288248-A1  
42 49511931 US-2013297320-A1 US-11392396-B1  
43 49511931 US-2013297320-A1 US-2019156825-A1  
44 49511931 US-2013297320-A1 US-11252286-B2  
45 49511931 US-2013297320-A1 US-2019061262-A1  
46 49511931 US-2013297320-A1 US-2015100149-A1  
47 49511931 US-2013297320-A1 US-11276095-B1  
48 49511931 US-2013297320-A1 US-11072122-B2  
49 49511931 US-2013297320-A1 US-10553210-B2  
50 49511931 US-2013297320-A1 US-10373183-B1
alekssadowski95 commented 2 years ago

@wetherbeei Thank you very much. I also noticed that I had a bug in my implementation something to do with case sensitivity of strings which reduced the number of results. e.g. "MakerBot" vs "Makerbot"