openeduhub / metaqs-main

Backend Service providing Information about Completeness of Metadata and Coverage of Topics
3 stars 1 forks source link

Idea to fully eliminate the background tasks #110

Open MRuecklCC opened 2 years ago

MRuecklCC commented 2 years ago

Currently, we can use the following query to get the number of materials with missing properties resolved by collection or learning resource type:

GET /workspace/_search
{
  "query": {
    "bool": {
      "filter": [
        {
          "term": {
            "permissions.Read.keyword": "GROUP_EVERYONE"
          }
        },
        {
          "term": {
            "properties.cm:edu_metadataset.keyword": "mds_oeh"
          }
        },
        {
          "term": {
            "nodeRef.storeRef.protocol": "workspace"
          }
        },
        {
          "term": {
            "type": "ccm:io"
          }
        },
!# This is the filter for materials to be part of the respective collection subtree.
        {
          "bool": {
            "should": [
              {
                "term": {
                  "collections.nodeRef.id.keyword": "15fce411-54d9-467f-8f35-61ea374a298d"
                }
              },
              {
                "match": {
                  "collections.path.keyword": "15fce411-54d9-467f-8f35-61ea374a298d"
                }
              }
            ]
          }
        }
      ]
    }
  },
  "size": 0,
  "aggs": {
    "collection": {
      "terms": {
!# here we could also group by learning resource type instead
        "field": "collections.nodeRef.id.keyword",
        "size": 100
      },
      "aggs": {
        "feedback_comment": {
          "missing": {
            "field": "properties.feedback_comment.keyword"
          }
        },
        "ccm:metadatacontributer_creator": {
          "missing": {
            "field": "properties.ccm:metadatacontributer_creator.keyword"
          }
        },
        "ccm:metadatacontributer_provider": {
          "missing": {
            "field": "properties.ccm:metadatacontributer_provider.keyword"
          }
        },
        "ccm:metadatacontributer_validator": {
          "missing": {
            "field": "properties.ccm:metadatacontributer_validator.keyword"
          }
        },
!#  .... < other missing attribute sub aggregations >
      }
    }
  }
}

If we find a solution to also add the material IDs to the result of the aggregation, we are good, as we can then use this query to generate the data for the quality matrices as well as the overview that links to the materials which lack missing attributes: image

If we cannot build such a query, we can work around this by just providing the number in above screenshot. Once the user clicks one of the number, we then run a query for that specific cell to fetch the set of material ids which are then passed into the link to edusharing.

MRuecklCC commented 2 years ago

@torsten-simon fyi. I actually would prefer the simpler query. Implementation wise this would deduplicate three different implementations. And the Endpoint for the materials that lack a certain attribute is already in place (for the tiles that also contain the preview). So all that we would need is implement this on the frontend side :roll_eyes:

MRuecklCC commented 2 years ago

An alternative query which gives the full set of material ids per inner bucket goes as follows:

GET /workspace/_search
{
  "query": {
    "bool": {
      "filter": [
        {
          "term": {
            "permissions.Read.keyword": "GROUP_EVERYONE"
          }
        },
        {
          "term": {
            "properties.cm:edu_metadataset.keyword": "mds_oeh"
          }
        },
        {
          "term": {
            "nodeRef.storeRef.protocol": "workspace"
          }
        },
        {
          "term": {
            "type": "ccm:io"
          }
        },
        {
          "bool": {
            "should": [
              {
                "term": {
                  "collections.nodeRef.id.keyword": "15fce411-54d9-467f-8f35-61ea374a298d"
                }
              },
              {
                "match": {
                  "collections.path.keyword": "15fce411-54d9-467f-8f35-61ea374a298d"
                }
              }
            ]
          }
        }
      ]
    }
  },
  "size": 0,
  "aggs": {
    "collection": {
      "terms": {
        "field": "properties.ccm:oeh_lrt.keyword",
        "size": 100
      },
      "aggs": {
        "feedback_comment": {
          "aggs": {
            "docs": {
              "top_hits": {
                "size": 10,
                "_source": [
                  "nodeRef.id"
                ]
              }
            }
          },
          "missing": {
            "field": "properties.feedback_comment.keyword"
          }
        },
        "ccm:metadatacontributer_creator": {
          "aggs": {
            "docs": {
              "top_hits": {
                "size": 10,
                "_source": [
                  "nodeRef.id"
                ]
              }
            }
          },
          "missing": {
            "field": "properties.ccm:metadatacontributer_creator.keyword"
          }
        }
      }
    }
  }
}

Thanks to @rfalke <3

MRuecklCC commented 2 years ago

Unfortunately above query does not work either without reconfiguring the elastic search server / index :-(

RequestError(400, 'search_phase_execution_exception', "Top hits result window is too large, the top hits 
aggregator [material-ids]'s from + size must be less than or equal to: [100] but was [500000]. This limit
 can be set by changing the [index.max_inner_result_window] index level setting.")

I guess the best solution is to only query the number of materials in the respective buckets and then issue a separate request to get the material ids of the bucket once the user clicks on the number in the frontend.