Closed mietcls closed 1 year ago
It's possible, at least, in ElasticSearch to construct queries that return:
""
)I ran a quick test and I was able to retrieve records without a department or records without a publication year. That is, these were records that didn't have those fields. So, it is possible. The challenge here is sorting out the semantics e.g. when someone doesn't fill out a form field and how that gets indexed, or how this will apply to older data.
@netsensei thanks! Do we need "research" issues for this or should we try implementing it?
I'd go for "research" and I would also like some input from @nicolasfranck on this one. :-)
Luckily, due to the nature of how golang encodes its JSON, values with an empty value are not serialized to JSON (if marked with omitempty
which almost always the case).
Note: if we are adding a facet with several checkboxes (like we did before), but every checkbox is the result of a query (instead of a token), then ES expects us to add a filter aggregation PER possible value:
{
"query" : {
"match_all": {}
},
"size": 0,
"aggs": {
"no_department": {
"missing": { "field": "department" }
},
"has_department": {
"filter": {
"exists": { "field": "department" }
}
},
"type": {
"terms": { "field": "type" }
}
}
}
{
"took" : 177,
"timed_out" : false,
"_shards" : {
"total" : 1,
"successful" : 1,
"skipped" : 0,
"failed" : 0
},
"hits" : {
"total" : 361225,
"max_score" : 0.0,
"hits" : [ ]
},
"aggregations" : {
"no_department" : {
"doc_count" : 361225
},
"type" : {
"doc_count_error_upper_bound" : 0,
"sum_other_doc_count" : 0,
"buckets" : [
{
"key" : "journal_article",
"doc_count" : 191579
},
{
"key" : "conference",
"doc_count" : 78453
},
{
"key" : "miscellaneous",
"doc_count" : 35131
},
{
"key" : "book_chapter",
"doc_count" : 30086
},
{
"key" : "dissertation",
"doc_count" : 13284
},
{
"key" : "book",
"doc_count" : 7447
},
{
"key" : "book_editor",
"doc_count" : 5018
},
{
"key" : "issue_editor",
"doc_count" : 227
}
]
},
"has_department" : {
"doc_count" : 345655
}
}
}
See? The terms aggregation for type
returned all possible values nicely, while we had to add two filter aggregations no_department
and has_department
, which means we have to collect from two aggregations while constructing the html filter. This also means we have to "invent" a key value (for filling the input value) when someone select one of them:
e.g. filter "has department":
[ ] has department [x] has no department
While previously we could use the returned "key" from the bucket (see in facet "type") and add f[<field>]=<value>
(e.g. f[type]=journal_article
, we now have to "invent" a value like "true" en "false", and map them server side back to the appropriate filter, and put an OR between them.
Of course we can add index only fields like has_department
which will work faster? (But which will require reindexing)
Question is also, where are we supposed to put filter values like that.
e.g. filter "department" is based on existing tokens in the index. A filter "no department" is likely a checkbox to be added on top of it, but technically that means mixing filter aggregations and term aggregations.
Also, if it is only "missing values" you care about, we can also solve this by adding a fake facet value (e.g. "none"), which is server side translated into a filter on missing field.
Proposal: facets whose values do not originate from controlled vocabularies (like type, vabbb type), should limit their values to visible scope of the logged in user. e.g. if a user is not allowed to see withdrawn publications, why should he see values for "year" that originate from withdrawn publications? He cannot fix those, so indeed, why show them?
This cannot be done directly in elasticsearch:
filter
attribute limits the documents on which facets are calculated; it does NOT limit the returned facet values. e.g. year "3000" is returned as facet value, even though it is filtered.min_doc_count
to 1
, then only facet values with matching documents are returned. But other facet values are thrown out (with less than 1 match). Because there is no controlled vocabulary, there is no way to add these missing facet values yourself. So there is no way to select the other facet values anymore.So I propose to prefetch these values using an extra call to elasticsearch,
and them add them to the include
statement
Okay great research!
I tracked stories to visualise what the expectation is for filtering on "no year" and on "no faculty" here:
I will double check with the reviewers and curators first if this is what they need.
Can someone create separate issues that outline the technical steps to make these problems to be solved?
If this one is fixed, can we also implement #601 and perhaps #914 ?
@mietcls fixed where? Because this is just a question?
@nicolasfranck see comment above my previous comment.
Stories created, comments can be used later.
Question
Is it possible to show "records without department" and "records without publitation year" in the filter sets?
Now people have to make exports or accidentally bump on records to find these records.