hbz / lobid-resources

Transformation, web frontend, and API for the hbz catalog as LOD
http://lobid.org/resources
Eclipse Public License 2.0
8 stars 7 forks source link

Wrong count previews in owner facet #207

Open fsteeg opened 7 years ago

fsteeg commented 7 years ago

Since owners are based on exemplar aggregations, and aggregation requests have a limited size, the owner counts are wrong (just the owners of the most frequent X exemplar, which are actually all 1). To fix this, we have to improve the efficiency of the aggregations processing to enable an aggregations request with unlimited size for exemplars.

fsteeg commented 7 years ago

This can be reproduced with any queries returning high result counts, e.g. owner facet for: http://lobid.org/resources/search?q=k%C3%B6ln

fsteeg commented 7 years ago

The basic problem here is that we are faceting over a field (the item owner) that's not in our data. This approach won't work for the entire catalog: if we query everything, we'd have to get all items, and create the owner facet from that.

Instead, I suggest we add an exemplar.owner field, so for example in http://lobid.org/resources/HT012213725?format=json we'd have:

"exemplar": [{
  "id": "http://lobid.org/items/HT012213725:DE-6:ZD%207381#!",
  "owner": "http://lobid.org/organisations/DE-6",
  "label": "lobid Bestandsressource"
}],

That way, we could simply facet over exemplar.owner directly, which would give us all owners (not all items, as with the current facet, which is based on exemplar.id).

What do you think @dr0i @acka47? If it makes no sense to expose the owner in the data (but I do think it's useful for API usage), we could also create an internal Elasticsearch field or a custom aggregation. If we do want to expose it, we should add it on the Metafacture level.

acka47 commented 7 years ago

+1 from me. I already proposed embedding item information in the instance data, see #140. We might just reopen that issue.

dr0i commented 7 years ago

Using a child aggregation on our data querying "köln" seems to come with a plausible result:

"hits" : {
"total" : 569.808,
 ...
"aggregations" : {
"items" : {
  "doc_count" : 1.686.515,
  "top-isil" : {
  ...
    "buckets" : [ {
      "key" : "http://lobid.org/organisations/DE-38",
      "doc_count" : 172.288
    } ...

I can imagine that the factor 3 in ration resources/items is a result of libraries holding more than one item. Is this acceptable or do you really want to have a ration of 1? Though I doubt that if we take the data from the child into the parent and subsequently have e.g. 3 same exemplar.owner.id (reflecting the fact of multiple holdings of a manifestation (aka "resource")) an aggreagation about this would would result in that 1/1 ration (without tinkering with filter or something).

fsteeg commented 7 years ago

Oh nice, a child aggregation, I didn't consider that. That should work, I will try it.

fsteeg commented 7 years ago

Reopening, see discussion starting in https://github.com/hbz/lobid-resources/issues/278#issuecomment-283329330.

acka47 commented 3 years ago

This came up again, see #1169, where @hagbeck wrote:

From the Aleph based index we're getting 1.334.514 records [1] The facet "Bestand in Bibliotheken" in the Aleph based index shows 1.471.170 records.

[1] http://lobid.org/resources/search?owner=http%3A%2F%2Flobid.org%2Forganisations%2FDE-290%23%21&aggregations=owner

I pointed out this problem in https://github.com/hbz/lobid-resources/issues/278#issuecomment-283333385:

Isn't the underlying mechanism that the facet gives the number of items while the query result lists the FRBR manifestations (or in bibframe-speak: instances)?

TobiasNx commented 1 year ago

This came up again in context of the comparison of ALMA and ALEPH resources of UB Münster. Idealy this should be fixed before ALMA Fix replaces ALEPH-Morph. https://github.com/hbz/lobid-resources/issues/1601

acka47 commented 1 year ago

@blackwinter will take a look whether this should be added to milestone DigiBib or not.

blackwinter commented 1 year ago

We would not be affected by this issue.