ugent-library / biblio-backoffice

Apache License 2.0
7 stars 0 forks source link

Is it possible to show "records without department" and "records without publitation year" in the filters? #874

Closed mietcls closed 1 year ago

mietcls commented 1 year ago

Question

Is it possible to show "records without department" and "records without publitation year" in the filter sets?

Now people have to make exports or accidentally bump on records to find these records.

netsensei commented 1 year ago

It's possible, at least, in ElasticSearch to construct queries that return:

I ran a quick test and I was able to retrieve records without a department or records without a publication year. That is, these were records that didn't have those fields. So, it is possible. The challenge here is sorting out the semantics e.g. when someone doesn't fill out a form field and how that gets indexed, or how this will apply to older data.

mietcls commented 1 year ago

@netsensei thanks! Do we need "research" issues for this or should we try implementing it?

netsensei commented 1 year ago

I'd go for "research" and I would also like some input from @nicolasfranck on this one. :-)

nicolasfranck commented 1 year ago

Luckily, due to the nature of how golang encodes its JSON, values with an empty value are not serialized to JSON (if marked with omitempty which almost always the case).

nicolasfranck commented 1 year ago

Note: if we are adding a facet with several checkboxes (like we did before), but every checkbox is the result of a query (instead of a token), then ES expects us to add a filter aggregation PER possible value:

{
   "query" : {
      "match_all": {}
   },
   "size": 0,
   "aggs": {
      "no_department": {
        "missing": { "field": "department" }
      },
      "has_department": {
        "filter": {
          "exists": { "field": "department" }
        }
      },
      "type": {
        "terms": { "field": "type" }
      }
   }
}
{
  "took" : 177,
  "timed_out" : false,
  "_shards" : {
    "total" : 1,
    "successful" : 1,
    "skipped" : 0,
    "failed" : 0
  },
  "hits" : {
    "total" : 361225,
    "max_score" : 0.0,
    "hits" : [ ]
  },
  "aggregations" : {
    "no_department" : {
      "doc_count" : 361225
    },
    "type" : {
      "doc_count_error_upper_bound" : 0,
      "sum_other_doc_count" : 0,
      "buckets" : [
        {
          "key" : "journal_article",
          "doc_count" : 191579
        },
        {
          "key" : "conference",
          "doc_count" : 78453
        },
        {
          "key" : "miscellaneous",
          "doc_count" : 35131
        },
        {
          "key" : "book_chapter",
          "doc_count" : 30086
        },
        {
          "key" : "dissertation",
          "doc_count" : 13284
        },
        {
          "key" : "book",
          "doc_count" : 7447
        },
        {
          "key" : "book_editor",
          "doc_count" : 5018
        },
        {
          "key" : "issue_editor",
          "doc_count" : 227
        }
      ]
    },
    "has_department" : {
      "doc_count" : 345655
    }
  }
}

See? The terms aggregation for type returned all possible values nicely, while we had to add two filter aggregations no_department and has_department, which means we have to collect from two aggregations while constructing the html filter. This also means we have to "invent" a key value (for filling the input value) when someone select one of them:

e.g. filter "has department":

[ ] has department [x] has no department

While previously we could use the returned "key" from the bucket (see in facet "type") and add f[<field>]=<value> (e.g. f[type]=journal_article, we now have to "invent" a value like "true" en "false", and map them server side back to the appropriate filter, and put an OR between them.

Of course we can add index only fields like has_department which will work faster? (But which will require reindexing)

nicolasfranck commented 1 year ago

Question is also, where are we supposed to put filter values like that.

e.g. filter "department" is based on existing tokens in the index. A filter "no department" is likely a checkbox to be added on top of it, but technically that means mixing filter aggregations and term aggregations.

nicolasfranck commented 1 year ago

Also, if it is only "missing values" you care about, we can also solve this by adding a fake facet value (e.g. "none"), which is server side translated into a filter on missing field.

nicolasfranck commented 1 year ago

Proposal: facets whose values do not originate from controlled vocabularies (like type, vabbb type), should limit their values to visible scope of the logged in user. e.g. if a user is not allowed to see withdrawn publications, why should he see values for "year" that originate from withdrawn publications? He cannot fix those, so indeed, why show them?

This cannot be done directly in elasticsearch:

So I propose to prefetch these values using an extra call to elasticsearch, and them add them to the include statement

mietcls commented 1 year ago

Okay great research!

I tracked stories to visualise what the expectation is for filtering on "no year" and on "no faculty" here:

Can someone create separate issues that outline the technical steps to make these problems to be solved?

mietcls commented 1 year ago

If this one is fixed, can we also implement #601 and perhaps #914 ?

nicolasfranck commented 1 year ago

@mietcls fixed where? Because this is just a question?

mietcls commented 1 year ago

@nicolasfranck see comment above my previous comment.

mietcls commented 1 year ago

Stories created, comments can be used later.