hubmapconsortium / search-api

HuBMAP search service and associated pieces to create an index
https://search.api.hubmapconsortium.org
MIT License
2 stars 2 forks source link

Enable sorting on `publication_date` field #890

Open NickAkhmetov opened 2 weeks ago

NickAkhmetov commented 2 weeks ago

We are currently not sorting the publications results on the publications page: https://portal.hubmapconsortium.org/publications

As a result, the publications are displayed in a different order with each page load.

To address this, we would like to sort the results by the publication date; however, this is currently not supported by the data model, as the publication_date field is not marked as a data field.

Query:

{
  "query": {
    "bool": {
      "must": [
        {
          "bool": {
            "must": [
              {
                "term": {
                  "entity_type.keyword": "Publication"
                }
              },
              {
                "term": {
                  "publication_status": "true"
                }
              }
            ]
          }
        },
        {
          "bool": {
            "must_not": [
              {
                "exists": {
                  "field": "next_revision_uuid"
                }
              },
              {
                "exists": {
                  "field": "sub_status"
                }
              }
            ]
          }
        }
      ]
    }
  },
  "sort": [
    {
      "publication_date": {
        "order": "desc"
      }
    }
  ],
  "size": 10000,
  "_source": [
    "uuid",
    "title",
    "contributors",
    "publication_status",
    "publication_venue",
    "publication_date"
  ]
}

Response:

{
  "error": {
    "caused_by": {
      "caused_by": {
        "reason": "Text fields are not optimised for operations that require per-document field data like aggregations and sorting, so these operations are disabled by default. Please use a keyword field instead. Alternatively, set fielddata=true on [publication_date] in order to load field data by uninverting the inverted index. Note that this can use significant memory.",
        "type": "illegal_argument_exception"
      },
      "reason": "Text fields are not optimised for operations that require per-document field data like aggregations and sorting, so these operations are disabled by default. Please use a keyword field instead. Alternatively, set fielddata=true on [publication_date] in order to load field data by uninverting the inverted index. Note that this can use significant memory.",
      "type": "illegal_argument_exception"
    },
    "failed_shards": [
      {
        "index": "hm_prod_public_portal",
        "node": "QZ3KIU_hRuKFUfQfZ-6ISA",
        "reason": {
          "reason": "Text fields are not optimised for operations that require per-document field data like aggregations and sorting, so these operations are disabled by default. Please use a keyword field instead. Alternatively, set fielddata=true on [publication_date] in order to load field data by uninverting the inverted index. Note that this can use significant memory.",
          "type": "illegal_argument_exception"
        },
        "shard": 0
      }
    ],
    "grouped": true,
    "phase": "can_match",
    "reason": "all shards failed",
    "root_cause": [
      {
        "reason": "Text fields are not optimised for operations that require per-document field data like aggregations and sorting, so these operations are disabled by default. Please use a keyword field instead. Alternatively, set fielddata=true on [publication_date] in order to load field data by uninverting the inverted index. Note that this can use significant memory.",
        "type": "illegal_argument_exception"
      }
    ],
    "type": "search_phase_execution_exception"
  },
  "status": 400
}
lchoy commented 2 weeks ago

Due to dynamic mapping, the field publication_date is currently mapped as a text field, but it is recommended to map it as a long field to properly handle epoch timestamps. To facilitate sorting on publication_date without changing the mapping, one can utilize the publication_date.keywordfield. However, for improved efficiency and accuracy, it is advisable to directly map the publication_date field as a long type without the need for a separate keyword field.

NickAkhmetov commented 2 weeks ago

@lchoy Thanks for pointing out that .keyword solves this issue! Since there aren't too many publications currently in the system, that workaround should suffice for the time being, but this would definitely still be valuable in the long run.