parkerhancock / patent_client

A collection of ORM-style clients to public patent data
Other
89 stars 34 forks source link

Query limit #116

Closed federiconuta closed 11 months ago

federiconuta commented 11 months ago

Hi and thanks for the package. I am trying to query the API via multiprocessing package and I noticed that at a certain point (i.e. after a certain number of searches) the processes make no more progresses. Is there a query limit? If so, how many patents could be searched? Is it possible to search more patents at once?

Thank you

parkerhancock commented 11 months ago

If multiprocessing is involved, you're probably misusing this package. It is not intended for bulk downloads - only for reasonably limited queries (i.e. 10's to 100's of records).

What API are you getting rate limited on? Chances are, there's a better solution that involves USPTO Bulk Data Products and not PatentClient. Especially if you're trying to pull from Patents Public Search, which is the most flakey of the bunch.

federiconuta commented 11 months ago

Hi @parkerhancock . Thank you for the reply. To provide a detailed reply I would like to share with you the core of my code whee I am using PatentClient. In particular the patent to which I am referring in the code are 1.2 Million patents retrieved via Patstat:

def fetching_data_cached(self, patent):
        try:
            assignments = Assignment.objects.filter(patent_number=patent)
            assignments_df = assignments.to_pandas()

            rows = []
            for _, row in assignments_df.iterrows():
                trans_date = row.get('transaction_date', np.nan)
                trans_id = row.get('id', np.nan)
                assignee = row['assignees'][0]['name'] if 'assignees' in row and 'name' in row['assignees'][0] else np.nan
                rows.append((patent, trans_date, trans_id, assignee))

            return rows
        except Exception as e:
            print(f"Error processing patent {patent}: {e}")
            return []

In essence the API is called only inassignments = Assignment.objects.filter(patent_number=patent). After a few trials I notice that there may be an API limit after a certain number of patents has been downloaded. I was wondering if you are aware of this limit and if I can put a sleep somewhere to facilitate the download after that the limit is hit. Or, better, if PatentClient provide some solution there.

Thank you

parkerhancock commented 11 months ago

There very likely is a limit, but it isn't documented in the official USPTO API, and I don't know specifically where it is.

If you want to do that kind of bulk work with assignments, I'd suggest building your own database from the XML bulk files. The entire assignment database isn't that large, and the schema isn't that complex. You can use yankee to do the XML parsing relatively simply. I've actually built a parser for that exact purpose that isn't open source. And the entirety of the data supporting the USPTO Assignment API is available here:

https://bulkdata.uspto.gov/data/patent/assignment/

parkerhancock commented 11 months ago

Also, I just updated the library with support for the BDSS API, so you could in theory use those new features to build that out.

from patent_client.uspto.bulk_data.model import File

backfile = list(File.objects.filter_by_short_name("PASYR"))
frontfile = list(File.objects.filter_by_short_name("PASDL"))
for file in backfile+frontfile:
    file.download()

See also: BDSS Docs

federiconuta commented 11 months ago

@parkerhancock thank you very much for the updates and the kind reply.