ip-tools / uspto-opendata-python

A client library for accessing the USPTO Open Data APIs, written in Python.
https://docs.ip-tools.org/uspto-opendata-python/
MIT License
89 stars 23 forks source link

Result returns too many docs #10

Closed bioSandMan closed 5 years ago

bioSandMan commented 5 years ago

Hello, I have been using the API for about a month now and I noticed something different today. Using the below query returns 451438 records and should only be returning 269 records associated to the given examiner.

# Peds basic query to check if PEDS is online
from uspto.peds.client import UsptoPatentExaminationDataSystemClient
import pandas as pd
name = 'WILSON, NICHOLAS R'
client = UsptoPatentExaminationDataSystemClient()
expression = "appExamName:{0}".format(name)
result = client.search(expression)
{'numFound': 451438,
 'start': 0,
 'docs': [{'corrAddrCountryName': 'UNITED STATES',
   'applId': '03429712',
   'totalPtoDays': '0',
   'appFilingDate': '1954-05-13T00:00:00Z',
   'appExamName': 'MATZ, DANIEL R',
   'appExamNameFacet': 'MATZ, DANIEL R',
...

I also emailed PEDS. They recently throttled the number of requests they could handle but we were able to get them to increase it again. I don't think the problem I'm experiencing is associated to their changes tho. Any thoughts?

bioSandMan commented 5 years ago

Doing some basic testing it looks like it is happening when you commented out line 42 in uspto/util/client.py.

amotl commented 5 years ago

Dear Christopher,

thanks for writing in. We can confirm the erratic behavior you are observing:

uspto-peds search 'appExamName:WILSON, NICHOLAS R' | jq '.numFound'
451438

Introduction

The change f07ebef you are referring to which removes the mm parameter from the HTTP request came from #7 and has been introduced just recently. I recognize from the behavior you are observing that it seems to have an unfortunate side effect.

Investigation

As the change has been done in a hurry in order to support @rahul-gj, I now recognize that it's on me that I should have checked the meaning of this parameter first. After looking into the appropriate documentation about the Lucene/Solr DisMax query parser now, we should take these details about the mm (Minimum Should Match) Parameter into consideration:

  1. While the documentation on DisMax query parser says that

    The default value of mm is 100% (meaning that all clauses must match).

  2. An answer on stackoverflow (solr-mm-parameter-of-dismax-parser) says that

    If no mm parameter is specified in the query, or as a default in solrconfig.xml, the effective value of the q.op parameter (either in the query, as a default in solrconfig.xml, or from the 'defaultOperator' option in schema.xml) is used to influence the behavior. So, the default behavior of the mm is determined by q.op parameter. If q.op is effectively AND, then mm=100%; if q.op is OR, then mm=1.

Conclusion

So, we definitively should send the mm parameter from our end in order to control the query processing behavior of the Lucene/Solr query parser on the remote search backend. The appropriate value should be determined by the character of the query respectively by the intention of the researcher and should be populated in a way to adhere to do what I mean principles.

So, when implementing that, querying for numberlists with a search command like

uspto-peds search 'patentNumber:(6583088 6875727 8697602)'

should probably be handled a bit differently.

Thoughts

Let's see if we can a) determine the mm value heuristically from the query expression or b) whether we should spend another command line parameter for designating that or even c) just amend the documentation to propose an expression like

uspto-peds search 'patentNumber:(6583088 OR 6875727 OR 8697602)'

for querying numberlists - if this actually would be the right thing to do here.

If c) would fit the bill, we might even be able to set mm back to it's former value of 100% in order to solve your issue while still keeping @rahul-gj happy.

Thanks again for reporting this to us.

With kind regards, Andreas.

amotl commented 5 years ago

After some more investigations we want to share that a query like this will always return the correct number of results, regardless of the mm value.

uspto-peds search 'appExamName:"WILSON, NICHOLAS R"' | jq '.numFound'
269

The user interface at https://ped.uspto.gov/peds/#/search will also behave like that and add quotes to the search string "WILSON, NICHOLAS R" to make it verbatim. By the way, the user interface currently will always set mm=0%.

bioSandMan commented 5 years ago

Based on your observations I found this works in my case where I am taking advantage of the internal classes:

name = "WILSON, NICHOLAS R"
client = UsptoPatentExaminationDataSystemClient()
expression = 'appExamName:"{0}"'.format(name)
result = client.search(expression)
result

Note the double quotes around the variable for the format expression.

A change request to the code may not be necessary for the sake of @rahul-gj but perhaps a warning or something in the user doc?

I appreciate you looking into this.

bioSandMan commented 5 years ago

The usage of quotes around the search expression has been working. Thanks again!

amotl commented 5 years ago

The usage of quotes around the search expression has been working. Thanks again!

Thanks for letting me know. I've diverted #11 and #12 from here. Thanks likewise!