hubmapconsortium / search-api

HuBMAP search service and associated pieces to create an index
MIT License
2 stars 2 forks source link

Investigate reindex 409 errors on PROD #834

Closed yuanzhou closed 1 month ago

yuanzhou commented 1 month ago


ERROR:hubmap_translator:Unable to directly update elements of document with related_entity_target_elements=['immediate_descendants', 'descendants', 'immediate_ancestors', 'ancestors', 'source_samples', 'origin_samples', 'datasets'], related_entity_id=3280ceaaee2c24e262d46a5f83edfe98. Got status_code=409 at es_url=, endoint '_update/3280ceaaee2c24e262d46a5f83edfe98' with qdsl_update_payload_string={"script": {  "lang": "painless",  "source": "for (prop in ['immediate_descendants', 'descendants', 'immediate_ancestors', 'ancestors', 'source_samples', 'origin_samples', 'datasets']) {if (ctx._source.containsKey(prop))  {for (int i = 0; i < ctx._source[prop].length; ++i)   {if (ctx._source[prop][i]['uuid'] == params.modified_entity_uuid)    {ctx._source[prop][i] = params.revised_related_entity} } } }",  "params": {   "modified_entity_uuid": "564167adbbb2fdd64c24e7ea409c23f1",   "revised_related_entity": {"contains_human_genetic_sequences": false, "created_by_user_displayname": "HuBMAP Process", "created_by_user_email": "", "created_timestamp": 1720814265702, "creation_action": "Create Dataset Activity", "data_access_level": "consortium", "dataset_type": "Histology", "description": "H&E slides corresponding to CODEX datasets : ./B004_SB-reg002", "entity_type": "Dataset", "files": [], "group_name": "Stanford TMC", "group_uuid": "def5fd76-ed43-11e8-b56a-0e8017bdda58", "hubmap_id": "HBM458.SXBD.528", "last_modified_timestamp": 1720814409827, "provider_info": "H&E for CODEX : ./B004_SB-reg002", "status": "Submitted", "title": "Histology data from the small intestine of a 78-year-old black or african american male", "uuid": "564167adbbb2fdd64c24e7ea409c23f1"}  } } }.

Where the request payload is

  "script": {
    "lang": "painless",
    "source": "for (prop in ['immediate_descendants', 'descendants', 'immediate_ancestors', 'ancestors', 'source_samples', 'origin_samples', 'datasets']) {if (ctx._source.containsKey(prop))  {for (int i = 0; i < ctx._source[prop].length; ++i)   {if (ctx._source[prop][i]['uuid'] == params.modified_entity_uuid)    {ctx._source[prop][i] = params.revised_related_entity} } } }",
    "params": {
      "modified_entity_uuid": "564167adbbb2fdd64c24e7ea409c23f1",
      "revised_related_entity": {
        "contains_human_genetic_sequences": false,
        "created_by_user_displayname": "HuBMAP Process",
        "created_by_user_email": "",
        "created_timestamp": 1720814265702,
        "creation_action": "Create Dataset Activity",
        "data_access_level": "consortium",
        "dataset_type": "Histology",
        "description": "H&E slides corresponding to CODEX datasets : ./B004_SB-reg002",
        "entity_type": "Dataset",
        "files": [],
        "group_name": "Stanford TMC",
        "group_uuid": "def5fd76-ed43-11e8-b56a-0e8017bdda58",
        "hubmap_id": "HBM458.SXBD.528",
        "last_modified_timestamp": 1720814409827,
        "provider_info": "H&E for CODEX : ./B004_SB-reg002",
        "status": "Submitted",
        "title": "Histology data from the small intestine of a 78-year-old black or african american male",
        "uuid": "564167adbbb2fdd64c24e7ea409c23f1"
yuanzhou commented 1 month ago

This issue continued on 7/16/2024. Maybe related or helpful to debugging.

yuanzhou commented 1 month ago

Additional errors

DEBUG:opensearch_helper_functions:Target url:
--- Logging error ---
Traceback (most recent call last):
  File "/usr/lib64/python3.9/logging/", line 1086, in emit
    stream.write(msg + self.terminator)
OSError: [Errno 90] Message too long
Call stack:
  File "/usr/lib64/python3.9/", line 930, in _bootstrap
  File "/usr/lib64/python3.9/", line 973, in _bootstrap_inner
  File "/usr/lib64/python3.9/", line 910, in run
    self._target(*self._args, **self._kwargs)
  File "/usr/src/app/src/./", line 597, in translate
    self._directly_modify_related_entities( es_url=es_url
  File "/usr/src/app/src/./", line 401, in _directly_modify_related_entities
    opensearch_response = execute_opensearch_query(query_against=f"_update/{related_entity_id}"
  File "/usr/src/app/src/search-adaptor/src/", line 138, in execute_opensearch_query
Unable to print the message and arguments - possible formatting error.
Use the traceback above to help find the error.
DEBUG:urllib3.connectionpool:[]( "POST /hm_prod_consortium_entities/_delete_by_query?q=uuid:38ae6abc06d8f34c59413c958ad89731 HTTP/1.1" 409 599
ERROR:libs.es_writer:Failed to delete doc of uuid: 38ae6abc06d8f34c59413c958ad89731 from index: hm_prod_consortium_entities
ERROR:libs.es_writer:Error Message: {"took":28875,"timed_out":false,"total":1,"deleted":0,"batches":1,"version_conflicts":1,"noops":0,"retries":{"bulk":0,"search":0},"throttled_millis":0,"requests_per_second":-1.0,"throttled_until_millis":0,"failures":[{"index":"hm_prod_consortium_entities","type":"_doc","id":"38ae6abc06d8f34c59413c958ad89731","cause":{"type":"version_conflict_engine_exception","reason":"[38ae6abc06d8f34c59413c958ad89731]: version conflict, required seqNo [33766], primary term [1]. but no document was found","index_uuid":"qBHQcRrrR_-75suqjXixvQ","shard":"3","index":"hm_prod_consortium_entities"},"status":409}]}
yuanzhou commented 1 month ago

The 82 datasets grouped by donor (7 different donors):

│Dataset                           │Donor                             │
yuanzhou commented 1 month ago

7/19/2024 With the retry_on_conflict implemented, I was still able to reproduce this 409 issue when reindexing multiple datasets under the same donor e71689fb01e59f5f57cc3ec250ba9609. However, the error rate is noticeably lower now.

ERROR:hubmap_translator:OpenSearch message for 409 code: '
    "error": {
        "root_cause": [
                "type": "version_conflict_engine_exception",
                "reason": "[e71689fb01e59f5f57cc3ec250ba9609]: version conflict, required seqNo [71538], primary term [1]. current document has seqNo [71539] and primary term [1]",
                "index_uuid": "qBHQcRrrR_-75suqjXixvQ",
                "shard": "3",
                "index": "hm_prod_consortium_entities"
        "type": "version_conflict_engine_exception",
        "reason": "[e71689fb01e59f5f57cc3ec250ba9609]: version conflict, required seqNo [71538], primary term [1]. current document has seqNo [71539] and primary term [1]",
        "index_uuid": "qBHQcRrrR_-75suqjXixvQ",
        "shard": "3",
        "index": "hm_prod_consortium_entities"
    "status": 409
yuanzhou commented 1 month ago

7/20/2024, did more testing with retry and refresh, it could still cause 409 when lots of descendants with common samples and donors are being updated against ES directly. As a result, I disabled the recent direct update work and went back to the original procedure via The PR also involves a fix to the logging too long error made in the search-adaptor via

FYI @kburke, unfortunately this is a trial and error process. On the flip side, we have a much better understanding of the reindex procedure and limitations of Elasticsearch's Optimistic concurrency control.