Closed bioSandMan closed 5 years ago
Doing some basic testing it looks like it is happening when you commented out line 42 in uspto/util/client.py.
Dear Christopher,
thanks for writing in. We can confirm the erratic behavior you are observing:
uspto-peds search 'appExamName:WILSON, NICHOLAS R' | jq '.numFound'
451438
The change f07ebef you are referring to which removes the mm
parameter from the HTTP request came from #7 and has been introduced just recently. I recognize from the behavior you are observing that it seems to have an unfortunate side effect.
As the change has been done in a hurry in order to support @rahul-gj, I now recognize that it's on me that I should have checked the meaning of this parameter first. After looking into the appropriate documentation about the Lucene/Solr DisMax query parser now, we should take these details about the mm (Minimum Should Match) Parameter into consideration:
While the documentation on DisMax query parser says that
The default value of mm is 100% (meaning that all clauses must match).
An answer on stackoverflow (solr-mm-parameter-of-dismax-parser) says that
If no
mm
parameter is specified in the query, or as a default insolrconfig.xml
, the effective value of theq.op
parameter (either in the query, as a default insolrconfig.xml
, or from the 'defaultOperator' option inschema.xml
) is used to influence the behavior. So, the default behavior of themm
is determined byq.op
parameter. Ifq.op
is effectively AND, thenmm=100%
; ifq.op
is OR, thenmm=1
.
So, we definitively should send the mm
parameter from our end in order to control the query processing behavior of the Lucene/Solr query parser on the remote search backend. The appropriate value should be determined by the character of the query respectively by the intention of the researcher and should be populated in a way to adhere to do what I mean principles.
So, when implementing that, querying for numberlists with a search command like
uspto-peds search 'patentNumber:(6583088 6875727 8697602)'
should probably be handled a bit differently.
Let's see if we can a) determine the mm
value heuristically from the query expression or b) whether we should spend another command line parameter for designating that or even c) just amend the documentation to propose an expression like
uspto-peds search 'patentNumber:(6583088 OR 6875727 OR 8697602)'
for querying numberlists - if this actually would be the right thing to do here.
If c) would fit the bill, we might even be able to set mm
back to it's former value of 100%
in order to solve your issue while still keeping @rahul-gj happy.
Thanks again for reporting this to us.
With kind regards, Andreas.
After some more investigations we want to share that a query like this will always return the correct number of results, regardless of the mm
value.
uspto-peds search 'appExamName:"WILSON, NICHOLAS R"' | jq '.numFound'
269
The user interface at https://ped.uspto.gov/peds/#/search will also behave like that and add quotes to the search string "WILSON, NICHOLAS R"
to make it verbatim. By the way, the user interface currently will always set mm=0%
.
Based on your observations I found this works in my case where I am taking advantage of the internal classes:
name = "WILSON, NICHOLAS R"
client = UsptoPatentExaminationDataSystemClient()
expression = 'appExamName:"{0}"'.format(name)
result = client.search(expression)
result
Note the double quotes around the variable for the format expression.
A change request to the code may not be necessary for the sake of @rahul-gj but perhaps a warning or something in the user doc?
I appreciate you looking into this.
The usage of quotes around the search expression has been working. Thanks again!
The usage of quotes around the search expression has been working. Thanks again!
Thanks for letting me know. I've diverted #11 and #12 from here. Thanks likewise!
Hello, I have been using the API for about a month now and I noticed something different today. Using the below query returns 451438 records and should only be returning 269 records associated to the given examiner.
I also emailed PEDS. They recently throttled the number of requests they could handle but we were able to get them to increase it again. I don't think the problem I'm experiencing is associated to their changes tho. Any thoughts?