prior-art-archive / priorartarchive.org

Prior Art Archive Site
https://priorartarchive.org
GNU General Public License v2.0
3 stars 1 forks source link

Inability to search by cpc codes #13

Open NupurBharadwaj opened 5 years ago

NupurBharadwaj commented 5 years ago

The search syntax provides a way to filter by CPC code, but there is a significant error under some circumstances both on production and on dev (we assume; though dev does not have CPC codes, see #10). We believe this is a problem with the query parser (the jar referenced in #32).

slifty commented 5 years ago

There are two issues at play here:

Issue 1 (now captured in issue #10)

We don't have CPC code data in the dev database. We need to set up the URL route that google pings every night to ask for a site map. They would then post their results to a place that we would check.

We will provide a single URL to google that will provide the site map. It will re-parse / index the entire corpus through their classifier and then host their results at a URL they tell us. We then use that URL provided by google to populate our own CPC index.

We are not currently generating the site map on dev / v2. We are not currently ingesting anything from google on dev / v2 (since we aren't giving them anything, they can't give us anything back)

v1 is generating the site map somehow / somewhere it is in master but we have to find the code driving it. @joeltg is going to follow up.

v1 is also ingesting the url from google (see slack for the sample json). This is being done by Cisco, and needs to be implemented in v2. It sounds like the json includes our primary keys already so this mapping should be straightforward.

We would want to ingest the json on a (daily) cadence and always update to the latest. No need to keep track of changes / historic data.

Issue 2

Cisco has developed their own syntax for PrArAr. Part of that syntax is a way to filter by CPC code. That's documented at the bottom of the help page (the very very very bottom).

This has a significant error under some circumstances even on production; there may be a bug in the parser itself (the jar referenced in #32). It seems to be working right now (might have been fixed). This part of the issue may be part of #32.

Given all this detail, it is likely that this issue should be split into multiple issues.

joeltg commented 5 years ago

Some notes:

This is probably going to require a fix in the (Java) query parser. For reference, here are the parsed Query DSL objects, respectively:

{
  "query": {
    "bool": {
      "must": [
        {
          "bool": {
            "must": [
              {
                "term": {
                  "text": "atmel"
                }
              },
              {
                "term": {
                  "cpc": "H04L29/12339"
                }
              }
            ]
          }
        },
        {
          "span_near": {
            "clauses": [
              {
                "term": {
                  "text": "atmel"
                }
              },
              {
                "term": {
                  "cpc": "H04L29/12339"
                }
              }
            ],
            "slop": 2147483647,
            "in_order": false
          }
        }
      ]
    }
  }
}
{
  "query": {
    "bool": {
      "must": [
        {
          "term": {
            "cpc": "H04L29/12339"
          }
        },
        {
          "term": {
            "cpc": "H04L29/12339"
          }
        }
      ]
    }
  }
}