Open metasj opened 5 years ago
Related: Make CPC searches work. #13
@metasj Did CPC searches work before migration?
Yes.
Joel -- can you summarize this + break out remaining issues?
@joeltg wanted to bump this issue on your radar! I think either we should close this item or we should break out an explicit list in a comment (I would be glad to make the actual issues).
The search/query "language" is outlined at https://www.priorartarchive.org/help and we just have to make sure that all of the syntax features mentioned there work as expected. I think the simplest place to test these is as test events on the AWS Lambda management page; there are four simple ones there already and they're easy to create.
I think the parser works on everything - it's just that a lot of the data that should be searchable (cpc codes, dates, descriptions) aren't populated in the elasticsearch index.
For CPC codes, the output of the query parser on the search string H04L29/06.cpc.
looks like
{
"query": {
"bool": {
"must": [
{
"term": {
"cpc": "H04L29/12339"
}
},
{
"term": {
"cpc": "H04L29/12339"
}
}
]
}
}
}
which means that it's looking for a property cpc
in the elasticsearch index.
Elasticsearch is written from src/process.js
in file-parser
. Right now it writes the properties
const elasticIndex = {
title,
text,
fileUrl,
organizationId,
uploadDate: generatedAtTime,
contentLength: ContentLength,
contentType: ContentType,
}
and an additional publicationDate
and language
if Tika finds one. I'm not sure what the expected behavior around searching with dates is (publication date? upload date?).
Elasticsearch documents have typed fields, so adding a property to a document means updating the schema. The schema for the current index is here.
So I think the remaining issues are
"description": {"type": "text"}
cpcCodes
to cpc
, or update and re-deploy the query-parser
to parse CPC filters into a cpcCodes
query term. The first one is probably easier. This was my mistake; I'm sorry.description
to elasticsearch whenever a user updates it.cpc
to elasticsearch whenever we get new ones from Google.I think the only actual question is whether dates are upload dates or publication dates, and I suspect it should be publication date. This would mean changing the elastic schema and file-parser
to use a date
field instead of publicationDate
.
with a long-term goal of
The current query engine uses a domain-specific parser, written by Cisco [in Java]. It parses query strings into JSON elastic queries, applying filters, priorities, and other sorting functions.
This should be documented cleanly, along with tests for preserving functionality.
Current setup: We have a jar built from their code, deployed as a lambda trigger whenever anyone searches. That implementation doesn't work atm for all queries.
Prioritize search fixes (#13, #16, &c)