piskvorky / gensim-data

Data repository for pretrained NLP models and NLP corpora.
https://rare-technologies.com/new-api-for-pretrained-nlp-models-and-datasets-in-gensim/
GNU Lesser General Public License v2.1
965 stars 128 forks source link

Add patents from Google #8

Closed piskvorky closed 6 years ago

piskvorky commented 6 years ago

2,437,000 documents, 248 million sentences, 7.7 billion words: patents from http://deepdive.stanford.edu/opendata/#patent-google-patents (428GB). This is an already-preprocessed dataset, tokenized, with each token tagged by the Stanford parser.

I downloaded this patent dataset (in both SQL and CoNLL formats) to h2.

There's also the original Google Patents dataset, where the documents are raw scanned images of the patent pages (multi-page TIF). Also on h2, but probably less interesting for gensim-data.

piskvorky commented 6 years ago

Another link, same data but different preprocessing and seems of higher quality (XML format, patents split by claim): http://patents.reedtech.com/pgrbft.php

VanL commented 6 years ago

The highest quality patent data right now is probably from PatentsView, supported by the USPTO itself. They have individual TSV files: http://www.patentsview.org/download/ and they have a mysql data dump with the entire portfolio (latest at http://www.patentsview.org/data/patent_20170808.zip, data dictionary at http://www.patentsview.org/data/Patents_DB_dictionary_bulk_downloads.xlsx).

License is CC-BY 4.0. The advantage is that the files have already been normalized from the raw data dumps.

I'm already doing some work with this dataset; if there is a way in which I could help you can let me know.

piskvorky commented 6 years ago

@VanL thanks for the link!

What would be most helpful is summarizing the basic metadata for this dataset and writing a clear, concise description (why this dataset? potential use cases? links to related existing research / applications on this dataset etc). See the existing datasets for an example.

VanL commented 6 years ago

There are a lot of possible datasets from this. I have seen people focus on: 1) Bibliographic/front page data (Good for examining statistics, creating graphs showing connections) 2) Abstracts (Short texts summarizing the advance) 3) Claims (What describes the legally protected advance, also serves as a description) 4) Full text of the specification (longest part of the patent, describes invention and subparts in context) 5) All of the above

What would make the most sense? Or if I could provide a JSON format for each doc, Would people index into that as they wanted?

piskvorky commented 6 years ago

Yeah, I think that may be the easiest option -- an iterable over the full dataset that yields one dict per item (one patent).

Then people can filter this dict for whatever values they need. @menshikh-iv WDYT?

menshikh-iv commented 6 years ago

@piskvorky I agree, but the problem with dataset size (so big).

menshikh-iv commented 6 years ago

I created a special label for the really big dataset, we can't add it right now (because we need special storage for it, like Amazon S3 or something similar). These will be plans for the future.

menshikh-iv commented 6 years ago

I'll use files from reedtech for 2017

kamalgupta0808 commented 6 years ago

Can i use this corpus to get similar patents using gensim or any other library? I am working on google dataset and want to find similar patents that have the same claims(same context). Fasttext(facebook), Gensim(doc2Vec) are possible approaches that I am looking to apply but I am new to nlp so don't know how to go forward. Any help will be much appreciated. Thank You in advance

menshikh-iv commented 6 years ago

Yes, you can use one of the submitted datasets, see https://github.com/RaRe-Technologies/gensim-data/releases/tag/patent-2017 (also, you can use links suggested in issues with others datasets).

About some advise - please ask your question in mailing list