piskvorky / gensim-data

Data repository for pretrained NLP models and NLP corpora.
https://rare-technologies.com/new-api-for-pretrained-nlp-models-and-datasets-in-gensim/
GNU Lesser General Public License v2.1
965 stars 128 forks source link

Patent data from patentsview.org #29

Open piskvorky opened 6 years ago

piskvorky commented 6 years ago

The site http://www.patentsview.org/download/ contains USPTO patents in several formats, including with patent descriptions: 6,260,847 rows (patents) in 39.40 GB.

See if this data is better than what we have in gensim-data now (which is somewhat hard to parse and understand), what the license is, and enhance our patent dataset offering.