Search overview: summarize search issues from the update

metasj commented 5 years ago

The current query engine uses a domain-specific parser, written by Cisco [in Java]. It parses query strings into JSON elastic queries, applying filters, priorities, and other sorting functions.

This should be documented cleanly, along with tests for preserving functionality.

Joel has a running gist of config issues. Take over + update this ++ In future: make this repeatable (config + other tweaks)

Current setup: We have a jar built from their code, deployed as a lambda trigger whenever anyone searches. That implementation doesn't work atm for all queries.

We've updated to a newer version of Elastic; some things they rely on no longer work the same way (special character + filter processing)
The previous Elastic server config was updated manually through the web UI; no config file. We may need to have a conf call w/ Cisco to extract what config details they used / we need.
A few related aspects of v1 PAA architecture is still running on their machines, including Kafka queues. * And one aspect involves round-trips to Google (They query the latest documents, run them all through a CPC-code generator, spit out a single large file hosted on their server, we read this + add the codes to our database)

Prioritize search fixes (#13, #16, &c)

metasj commented 5 years ago

Related: Make CPC searches work. #13

reefdog commented 5 years ago

@metasj Did CPC searches work before migration?

metasj commented 5 years ago

Yes.

metasj commented 5 years ago

Joel -- can you summarize this + break out remaining issues?

slifty commented 5 years ago

@joeltg wanted to bump this issue on your radar! I think either we should close this item or we should break out an explicit list in a comment (I would be glad to make the actual issues).

joeltg commented 5 years ago

The search/query "language" is outlined at https://www.priorartarchive.org/help and we just have to make sure that all of the syntax features mentioned there work as expected. I think the simplest place to test these is as test events on the AWS Lambda management page; there are four simple ones there already and they're easy to create.

I think the parser works on everything - it's just that a lot of the data that should be searchable (cpc codes, dates, descriptions) aren't populated in the elasticsearch index.

For CPC codes, the output of the query parser on the search string H04L29/06.cpc. looks like

{
  "query": {
    "bool": {
      "must": [
        {
          "term": {
            "cpc": "H04L29/12339"
          }
        },
        {
          "term": {
            "cpc": "H04L29/12339"
          }
        }
      ]
    }
  }
}

which means that it's looking for a property cpc in the elasticsearch index.

Elasticsearch is written from src/process.js in file-parser. Right now it writes the properties

const elasticIndex = {
  title,
  text,
  fileUrl,
  organizationId,
  uploadDate: generatedAtTime,
  contentLength: ContentLength,
  contentType: ContentType,
}

and an additional publicationDate and language if Tika finds one. I'm not sure what the expected behavior around searching with dates is (publication date? upload date?).

Elasticsearch documents have typed fields, so adding a property to a document means updating the schema. The schema for the current index is here.

So I think the remaining issues are

[ ] Update the elasticsearch config schema to index "description": {"type": "text"}
[ ] Either update the elasticsearch schema to rename cpcCodes to cpc, or update and re-deploy the query-parser to parse CPC filters into a cpcCodes query term. The first one is probably easier. This was my mistake; I'm sorry.
[ ] Write a description to elasticsearch whenever a user updates it.
[ ] Write a cpc to elasticsearch whenever we get new ones from Google.
[ ] Figure out what expectations around searching/filtering for dates are, and then implement it 🙃

I think the only actual question is whether dates are upload dates or publication dates, and I suspect it should be publication date. This would mean changing the elastic schema and file-parser to use a date field instead of publicationDate.

with a long-term goal of

[ ] Rewrite the query parser

prior-art-archive / priorartarchive.org

Search overview: summarize search issues from the update #32