Development/Use-case - Githubissues

Hi @parkerhancock, this is clearly the best and most up to date resource to access the sometimes cumbersome USPTO data. We spoke about trademarks a while ago. I am now getting quite interested in the future development of this project in an attempt to study corporate innovation and its relationship with company success.

A few things to discuss:

It seems that the current scope of your project is to tease out individual patents, like to get a corpus of patents applied for and received both nationally and internationally for say "microsoft".
A research question that I have is how can one create three rectangular datasets for patent applications, grants, and assignments - needed for up to date machine learning/big data analyses.
It seems that it would be best to work with weekly XML files for applications and grants, and then the daily XML file for assignments and to keep the data up to date by parsing and uploading the parsed data to some database.
I also see weekly action files, which makes me wonder how they mediate the relationships between application and grant data, is it some sort of intermediary dataset that keeps you up to date before an application is granted? If so I guess this is also an interesting dataset to keep tabs on.
I have seen you comment on the @patentpy repository that your solution could replace theirs potentially, the developer here has a solution for grants but not yet for applications.
Is this something that your library provides, and if not where can one possibly find software that has streamlined thie above rectangular dataset creation process.
Given your expertise in this area, do you think it is worth to also learn more about the USPTO API, or should ones focus be on the bulk files?
I have noted that there are some additional datasets out there like the maintenance fee and classification datasets, do you think there are some smart ways to merge this info into these three (app, grant, assignmt) dataset.
Furthermore, I have seen some APIs somewhat separate from the original APIs like the patent examination system, would you know what that is all about, I also think patentsview can do some of the above, but is the problem not that it is too slow to update versus the fast weekly files above?
Lastely do you think I have missed anything out, is there other datasets that I should consider/find, like citation data?

Its great to see someone as interested in this field as I am looking forward to any response or guidence you might give.

Edit: @mustberuss Hi Russ, it would be great to have you on this exchange to given your expertise.

Hi @firmai,

You've hit most of what I would have suggested:

Patentsview would be nice for the raw xml processing but they only update their bulk files 3 or 4 times a year
You could process the bulk assignments too, patentsview doesn't even try

There are a couple of problems/gaps using the bulk files:

The uspto's bulk cpc file only has assignments for utility patents- no reissues or plant patents (my site has a page about plant cpcs here https://historicip.com/plant-patent-cpc-statistics/) Similar thing on the uspc/ccl's, the uspto hasn't updated their bulk file since 2018 even though plant patents, design patents and reissues of either continue to receive uspc classifications (see https://bulkdata.uspto.gov/data/patent/classification/). You could stand up a database of processed xml files but the classifications just mentioned would at issue, not necessarily current ones.
There is also the misconception that all applications from 2001 on are available in the bulk xml files. Applications that are only made in the US can be suppressed, they won't be in the application xml even if the patent is granted (it will be in the grant xml but the uspto doesn't go back and publish the application xml). See my reply here https://patentsview.org/forum/7/topic/286

I haven't looked into the additional datasets and I haven't tried using peds' massive export, even the delta file scares me!

Depending on what you are trying to do, processing the bulk xml files may be the way to go if you can live with the gaps.

There is also the misconception that all applications from 2001 on are available in the bulk xml files. Applications that are only made in the US can be suppressed, they won't be in the application xml even if the patent is granted (it will be in the grant xml but the uspto doesn't go back and publish the application xml). See my reply here https://patentsview.org/forum/7/topic/286

That is a great heads-up, and I will take note of that going forward. Thanks for your answer, like always I find it extremely helpful.

Hey! Sorry to be late to the party! responses below:

It seems that the current scope of your project is to tease out individual patents, like to get a corpus of patents applied for and received both nationally and internationally for say "microsoft".

That's not entirely correct, and actually this isn't the right tool for that. Nearly everything in this library gets rate-limited if you go over 100 or so records - or is at least very cumbersome to deal with. I use this library nearly daily to help in my professional work to get things like case status updates, or the text of a few patents / prior art to support drafting office action responses. If you want something large enough to call a "corpus" (i.e. for NLP / ML tasks), I'd recommend Google BigQuery's Patent Publication data set. You can either export the whole thing (probably way overkill), or run a query to export a slice of it (date ranges / cpc codes / etc.). And BigQuery is good about providing exports in common data science formats.

A research question that I have is how can one create three rectangular datasets for patent applications, grants, and assignments - needed for up to date machine learning/big data analyses.

It seems that it would be best to work with weekly XML files for applications and grants, and then the daily XML file for assignments and to keep the data up to date by parsing and uploading the parsed data to some database.

XML bulk data is the way to go here. I've actually got a private repo where I've built out all the tooling for this. Throws it in a private MongoDB or PostgreSQL database, and indexes it. Or you can use the BigQuery solution above. It won't give you complete assignment history, but it integrates IFI's assignment information that tries to guess the current assignee based on the assignment history

I also see weekly action files, which makes me wonder how they mediate the relationships between application and grant data, is it some sort of intermediary dataset that keeps you up to date before an application is granted? If so I guess this is also an interesting dataset to keep tabs on.

Yup. This is the Patent Examination Data Set. Among the API's this library supports, my support for PEDS is the most robust, most tested, and most frequently updated, since I use it a ton as a practicing lawyer.

Also, unlike @mustberuss, I have had the courage to download the bulk file and work with it. The private repo I mentioned above is able to parse out the XML to databases at around 150-250 records/core/sec (ones with longer histories take longer). Last time I did it, it was about 20M records, so you're looking at about 100,000 core seconds. My 12 core home PC (w/ PCIe Gen 4 SSD) can do it in ~3-4 hours. The speed is also almost entirely bound by the computer's ability to evaluate XPath queries, which means you can go faster if you omit fields.

I have seen you comment on the @patentpy repository that your solution could replace theirs potentially, the developer here has a solution for grants but not yet for applications.

Is this something that your library provides, and if not where can one possibly find software that has streamlined thie above rectangular dataset creation process.

This library doesn't provide that functionality, but the aforementioned private one does. If you want to take a go at it, I'd recommend using Yankee, which is the core XML parsing logic extracted from that project and patent-client.

Given your expertise in this area, do you think it is worth to also learn more about the USPTO API, or should ones focus be on the bulk files?

This strongly depends on your application. If you're doing something that doesn't need to be up-to-the-minute, like training/fine tuning an LLM on patent data for patent drafting applications, or something more mundane like prior art / invalidity searches, then the text of the patents itself is all you really need. If you are advising clients and need realtime data, you have to look at the API's. And note that there is not API support for everything.

I have noted that there are some additional datasets out there like the maintenance fee and classification datasets, do you think there are some smart ways to merge this info into these three (app, grant, assignmt) dataset.

Yes. Some of that (classification info) is in the BigQuery database. Otherwise, just remember that an application number (which is not a patent number or a published application number) is the closest thing you get to a globally unique ID in the the USPTO world. As long as you have that for all the various data bits, just JOIN on it.

Furthermore, I have seen some APIs somewhat separate from the original APIs like the patent examination system, would you know what that is all about, I also think patentsview can do some of the above, but is the problem not that it is too slow to update versus the fast weekly files above?

Yeah, those systems are too slow. It also helps to keep in mind update schedules:

Application data (PEDS) updates daily
Assignment data (API / Bulk XML) updates daily
Patent Grant data (Bulk XML) updates weekly (on Tuesdays)
Published Patent Application data (Bulk XML) updates weekly (on Thursdays)

Lastely do you think I have missed anything out, is there other datasets that I should consider/find, like citation data?

Can't recommend the Google BigQuery's Patent Publication data set highly enough!

Finally, I'm going to close this issue since I don't want it to show up as an unresolved issue with the library, but please continue to comment if you have any other questions!

Hi @parkerhancock, I barely have a follow-up question, this thread is great and you can see it captures decades of knowledge on the subject. I have just peeked at the GBQ datasets at /patents and saw the latest file is publications_202212. So I guess that is also why you venture towards the bulk files, to close the gap between these unrecorded months?

You also hit it on the head, I am developing an LLM model. If you are interested, I would be happy to share the model weights with you if you can speed up my process in parsing the bulk files (to close the gap between GBQ) as I want to update the model regularly. I lean towards Postgres. Is there any reason why you are also using Mongo too, is that for some unstructured data?

Edit: forgot to say thanks for taking the time, including for open sourcing your software, it's very thoughtful of you.

@firmai I think you have the right idea! So, the publications data set (without the date) is the latest one, and it follows the update cycle for Google Patents - which I believe is quarterly (i.e. the last dated data set would have been 2022Q4, and the current one is 2023Q1). GBQ is usually just easier to use, because it abstracts away the maintenance tasks of trying to keep up my own database.

On the LLM - I would be very interested in what you're working on! As I carry an active prosecution docket - and I love this tech - I would be fascinated to see what you build! At least in my own mind, I think of this data as being maximally useful for patent drafting as fine-tuning input, and for prior art searching as prompt augmentation / search. Of course, lots of other potential applications too!

And as for MongoDB over Postgres, there are basically four reasons:

Data Compression: MongoDB by default applies table-wide compression (well, collection wide). When dealing with datasets this size, and that include a lot of repeated data, that is a huge advantage. I typically can get compression ratios for some datasets in excess of 85%.
Schemaless: I love that I don't have to care about schema. JSON is the lingua franca of the internet, and everything I do is in python (which as you know dicts ~= JSON). And while patent data is highly structured, there are tons of many-to-one and many-to-many relationships (not to mention graph-like tree traversal problems) that make hand-coding SQL schemas a huge pain. (Even in my own codebase, the PostgreSQL version is a conversion layer. I use a version of Yankee schemas that auto-generate JSONSchema files, that then use jsonschema2ddl to generate Postgres tables.)
Data Science: I also think MongoDB is generally better for data science applications than Postgres, because it tends to shine when its processing few - but complex - queries, or doing sequential data access of an entire data set (this is a personal - and potentially controversial - opinion). Postgres, and SQL Databases generally, are usually much better in production environments where they're receiving very large query volumes. MongoDB also has good connector support for big data tools, like Spark, Dask, and PowerBI.
Familiarity: I'm also really familiar with the MongoDB toolset. It's a bit of a learning curve coming from SQL world, where the dialect differences between DB's aren't very large. But once you've onboarded, it's pretty straightforward.

And as for open source - I'm always happy to help! The biggest reason I did it was to spark exactly these kinds of conversations and connections. The world of people doing data science in patents is actually pretty small, but it's such an interesting group of people!

parkerhancock / patent_client

Development/Use-case #100