r-three / common-pile

Repo to hold code and track issues for the collection of permissively licensed data
MIT License
22 stars 6 forks source link

Patent Data #9

Closed blester125 closed 4 months ago

blester125 commented 11 months ago

Domain: Patents

Can we use the Google Patents data for this?

It might be possible to use C4/Common Crawl data for this as patents.google.com is one of the most represented domains in c4

craffel commented 11 months ago

For the record, from https://www.dol.gov/general/aboutdol/copyright

As part of the terms of granting the patent to the inventor, patents are published into the public domain.

chris-ha458 commented 9 months ago

Would this be relevant in this context?

https://www.uspto.gov/learning-and-resources/bulk-data-products

chris-ha458 commented 9 months ago

https://bulkdata.uspto.gov/ I'll take a look into some of those sets.

chris-ha458 commented 9 months ago

If this line of inquiry is fruitful, the following might be useful as well.

It ostensibly combines multiple countries datasets and multiple other patent datasets as well. However, I do not have a proper GCP account (which is necessary for the queries and even the queries cost money) so I'd appreciate input from somebody familiar with GCP / GCP datasets https://console.cloud.google.com/marketplace/product/google_patents_public_datasets/google-patents-public-data?pli=1&project=api-project-904060009868

StellaAthena commented 9 months ago

@sunnydigital @chris-ha458 any updates on this?

sunnydigital commented 9 months ago

@sunnydigital @chris-ha458 any updates on this?

Hi Stella, I'm no longer working on this project. Let me unassign myself.

baberabb commented 9 months ago

If this line of inquiry is fruitful, the following might be useful as well.

It ostensibly combines multiple countries datasets and multiple other patent datasets as well. However, I do not have a proper GCP account (which is necessary for the queries and even the queries cost money) so I'd appreciate input from somebody familiar with GCP / GCP datasets https://console.cloud.google.com/marketplace/product/google_patents_public_datasets/google-patents-public-data?pli=1&project=api-project-904060009868

Had a look and they only have text available for US publications. Other countries just have (v. short) abstracts, from what I could tell. I can take a look at the sets available from USPO if no one else is working on this.

chris-ha458 commented 9 months ago

@baberabb can you share how you accessed it? Did it require GCP credits?

@StellaAthena I do think this is a plausible pathway, but I am not able to spearhead it at the moment. I will try to assist any effots though.

baberabb commented 9 months ago

@baberabb can you share how you accessed it? Did it require GCP credits?

It's available through BigQuery which is Google's SQL-like database system. And Yes! charged me $20 and I just made a few requests. I think if you still have free GCP credits then you can use that.

baberabb commented 9 months ago

Ok got trial access and did some more experimenting and we can just use the Google dataset IMO. They provide full-text for all US patent publications (not applications) and titles/abstracts for all others. All in plain-text as well so will be easy to format. Total 150m rows and seems to have the full US record till Oct 27, 2023.

sample extract here.

StellaAthena commented 9 months ago

Ok got trial access and did some more experimenting and we can just use the Google dataset IMO. They provide full-text for all US patent publications (not applications) and titles/abstracts for all others. All in plain-text as well so will be easy to format. Total 150m rows and seems to have the full US record till Oct 27, 2023.

sample extract here.

Amazing!