prior-art-archive / priorartarchive.org

Prior Art Archive Site
https://priorartarchive.org
GNU General Public License v2.0
3 stars 1 forks source link

Search overview: summarize search issues from the update #32

Open metasj opened 5 years ago

metasj commented 5 years ago

The current query engine uses a domain-specific parser, written by Cisco [in Java]. It parses query strings into JSON elastic queries, applying filters, priorities, and other sorting functions.

This should be documented cleanly, along with tests for preserving functionality.

Current setup: We have a jar built from their code, deployed as a lambda trigger whenever anyone searches. That implementation doesn't work atm for all queries.

Prioritize search fixes (#13, #16, &c)

metasj commented 5 years ago

Related: Make CPC searches work. #13

reefdog commented 5 years ago

@metasj Did CPC searches work before migration?

metasj commented 5 years ago

Yes.

metasj commented 5 years ago

Joel -- can you summarize this + break out remaining issues?

slifty commented 5 years ago

@joeltg wanted to bump this issue on your radar! I think either we should close this item or we should break out an explicit list in a comment (I would be glad to make the actual issues).

joeltg commented 5 years ago

The search/query "language" is outlined at https://www.priorartarchive.org/help and we just have to make sure that all of the syntax features mentioned there work as expected. I think the simplest place to test these is as test events on the AWS Lambda management page; there are four simple ones there already and they're easy to create.

I think the parser works on everything - it's just that a lot of the data that should be searchable (cpc codes, dates, descriptions) aren't populated in the elasticsearch index.

For CPC codes, the output of the query parser on the search string H04L29/06.cpc. looks like

{
  "query": {
    "bool": {
      "must": [
        {
          "term": {
            "cpc": "H04L29/12339"
          }
        },
        {
          "term": {
            "cpc": "H04L29/12339"
          }
        }
      ]
    }
  }
}

which means that it's looking for a property cpc in the elasticsearch index.

Elasticsearch is written from src/process.js in file-parser. Right now it writes the properties

const elasticIndex = {
  title,
  text,
  fileUrl,
  organizationId,
  uploadDate: generatedAtTime,
  contentLength: ContentLength,
  contentType: ContentType,
}

and an additional publicationDate and language if Tika finds one. I'm not sure what the expected behavior around searching with dates is (publication date? upload date?).

Elasticsearch documents have typed fields, so adding a property to a document means updating the schema. The schema for the current index is here.

So I think the remaining issues are

I think the only actual question is whether dates are upload dates or publication dates, and I suspect it should be publication date. This would mean changing the elastic schema and file-parser to use a date field instead of publicationDate.

with a long-term goal of