parkerhancock / patent_client

A collection of ORM-style clients to public patent data
Other
87 stars 34 forks source link

BUG: Public Search Returns a Length of 500 without raising an appropriate warning. #134

Closed kepiej closed 5 months ago

kepiej commented 9 months ago

Thanks for developing this nice and super useful library! :) I'm trying to use the Patent Public Search Basic to retrieve results and find the total number of results as follows:

from patent_client import PublicSearchBiblio
result = PublicSearchBiblio.objects.filter(query="lignin.TTL.")
print(len(result))

This prints 500. However, when I try to collect all results using

df = result.values("publication_number", "publication_date", "app_filing_date", "patent_title", "applicant_names", "type", "assignee_names").to_pandas()

then df contains 2249 rows! The same happens for different search terms.

Is this a bug or am I doing something wrong here?

parkerhancock commented 9 months ago

Hi @kepiej! Thanks for the question, and thanks for your interest in the library!

So, the USPTO's Public Search system is based on Apache Solr, which only returns accurate total result numbers if the quantity is less than 500. What that means is that if the len function returns 500, it should be interpreted as "There are 500 or more results," rather than "There are 500 results." It should, however, give accurate counts for any query that returns fewer than 500 results.

I'll keep this open, and in a future version, have it raise a warning if you use the Public Search API and the result is >= 500.

Thanks!

kepiej commented 8 months ago

Thanks @parkerhancock for the clarification!

I did also notice that for results < 500 the len function is also not accurate. Consider this example:

from patent_client import PublicSearchBiblio
result = PublicSearchBiblio.objects.filter(query="(solvent NEAR1 recovery).TTL.").order_by("-app_filing_date")
print(len(result))

This yields 291. However, when actually fetching the data as a pandas dataframe the length is much larger:

df = PublicSearchBiblio.objects.filter(query="(solvent NEAR1 recovery).TTL.")
.order_by("-app_filing_date")
.values(
        "publication_number",
        "patent_title",
        "applicant_names",
        "assignee_names",
        "publication_date",
        "app_filing_date",
        "type",
    ).to_pandas()
print(df.shape[0])

This returns a pandas dataframe with 445 rows!

Any idea what's going on here?

Akshit-Carboledger commented 6 months ago

Hi, for me when I call the same code it throws error as:

An error occurred: Client error '401 Unauthorized' for url 'https://ppubs.uspto.gov/dirsearch-public/searches/searchWithBeFamily'
For more information check: https://developer.mozilla.org/en-US/docs/Web/HTTP/Status/401

Is there any other configuration that needs to be there or what? @parkerhancock

parkerhancock commented 5 months ago

This should be fixed in v5. if you keep having issues, let me know!