opensearch-project / OpenSearch

🔎 Open source distributed and RESTful search engine.
https://opensearch.org/docs/latest/opensearch/index/
Apache License 2.0
9.84k stars 1.83k forks source link

[WIP] Improve serialization for TaskResourceInfo #16700

Open ansjcy opened 16 hours ago

ansjcy commented 16 hours ago

Description

Use binary serialization to avoid the JSON parsing overhead when piggybacking task resource usage info from data nodes to coordinator node.

Related Issues

Resolves https://github.com/opensearch-project/OpenSearch/issues/16635

Tests

I ran the big5 benchmark tests on a cluster with 3 master nodes (c5.xlarge) and 2 data nodes (r5.4xlarge) and did CPU profiling for term queries like mentioned in https://github.com/opensearch-project/OpenSearch/issues/16635. The parsing overhead is less than 1% in my tests.

image

Also validated the functionalities of query insights is not impacted.

curl -X GET "localhost:9200/_insights/top_queries?pretty"
{
  "top_queries" : [
    {
      "timestamp" : 1732222066991,
      "total_shards" : 2,
      "indices" : [
        "my-index-*"
      ],
      "node_id" : "MhvRcvgYSH2-AAThxmjosQ",
      "source" : {
        "size" : 1000
      },
      "task_resource_usages" : [
        {
          "action" : "indices:data/read/search[phase/query]",
          "taskId" : 241,
          "parentTaskId" : 146,
          "nodeId" : "dfYZt3ZRQXadtHY4XPRoMg",
          "taskResourceUsage" : {
            "cpu_time_in_nanos" : 7437000,
            "memory_in_bytes" : 807984
          }
        },
        {
          "action" : "indices:data/read/search[phase/query]",
          "taskId" : 127,
          "parentTaskId" : 146,
          "nodeId" : "Hek0j1IZQ4qfNsw6ftlbTQ",
          "taskResourceUsage" : {
            "cpu_time_in_nanos" : 8863000,
            "memory_in_bytes" : 934232
          }
        },
        {
          "action" : "indices:data/read/search[phase/fetch/id]",
          "taskId" : 242,
          "parentTaskId" : 146,
          "nodeId" : "dfYZt3ZRQXadtHY4XPRoMg",
          "taskResourceUsage" : {
            "cpu_time_in_nanos" : 5919000,
            "memory_in_bytes" : 852568
          }
        },
        {
          "action" : "indices:data/read/search[phase/fetch/id]",
          "taskId" : 128,
          "parentTaskId" : 146,
          "nodeId" : "Hek0j1IZQ4qfNsw6ftlbTQ",
          "taskResourceUsage" : {
            "cpu_time_in_nanos" : 5759000,
            "memory_in_bytes" : 867528
          }
        },
        {
          "action" : "indices:data/read/search",
          "taskId" : 146,
          "parentTaskId" : -1,
          "nodeId" : "MhvRcvgYSH2-AAThxmjosQ",
          "taskResourceUsage" : {
            "cpu_time_in_nanos" : 2200000,
            "memory_in_bytes" : 270560
          }
        }
      ],
      "search_type" : "query_then_fetch",
      "phase_latency_map" : {
        "expand" : 0,
        "query" : 50,
        "fetch" : 19
      },
      "labels" : {
        "X-Opaque-Id" : "cyji-id"
      },
      "measurements" : {
        "latency" : {
          "number" : 84,
          "count" : 1,
          "aggregationType" : "NONE"
        },
        "cpu" : {
          "number" : 30178000,
          "count" : 1,
          "aggregationType" : "NONE"
        },
        "memory" : {
          "number" : 3732872,
          "count" : 1,
          "aggregationType" : "NONE"
        }
      }
    },
    {
      "timestamp" : 1732222067165,
      "total_shards" : 1,
      "indices" : [
        "my-index-0"
      ],
      "node_id" : "MhvRcvgYSH2-AAThxmjosQ",
      "source" : {
        "size" : 20,
        "query" : {
          "bool" : {
            "must" : [
              {
                "match_phrase" : {
                  "message" : {
                    "query" : "document",
                    "slop" : 0,
                    "zero_terms_query" : "NONE",
                    "boost" : 1.0
                  }
                }
              },
              {
                "match" : {
                  "user.id" : {
                    "query" : "cyji",
                    "operator" : "OR",
                    "prefix_length" : 0,
                    "max_expansions" : 50,
                    "fuzzy_transpositions" : true,
                    "lenient" : false,
                    "zero_terms_query" : "NONE",
                    "auto_generate_synonyms_phrase_query" : true,
                    "boost" : 1.0
                  }
                }
              }
            ],
            "adjust_pure_negative" : true,
            "boost" : 1.0
          }
        }
      },
      "task_resource_usages" : [
        {
          "action" : "indices:data/read/search[phase/query]",
          "taskId" : 252,
          "parentTaskId" : 149,
          "nodeId" : "dfYZt3ZRQXadtHY4XPRoMg",
          "taskResourceUsage" : {
            "cpu_time_in_nanos" : 4085000,
            "memory_in_bytes" : 483992
          }
        },
        {
          "action" : "indices:data/read/search",
          "taskId" : 149,
          "parentTaskId" : -1,
          "nodeId" : "MhvRcvgYSH2-AAThxmjosQ",
          "taskResourceUsage" : {
            "cpu_time_in_nanos" : 656000,
            "memory_in_bytes" : 70880
          }
        }
      ],
      "search_type" : "query_then_fetch",
      "phase_latency_map" : {
        "expand" : 0,
        "query" : 16,
        "fetch" : 0
      },
      "labels" : { },
      "measurements" : {
        "latency" : {
          "number" : 17,
          "count" : 1,
          "aggregationType" : "NONE"
        },
        "cpu" : {
          "number" : 4741000,
          "count" : 1,
          "aggregationType" : "NONE"
        },
        "memory" : {
          "number" : 554872,
          "count" : 1,
          "aggregationType" : "NONE"
        }
      }
    },
    {
      "timestamp" : 1732222067129,
      "total_shards" : 1,
      "indices" : [
        "my-index-0"
      ],
      "node_id" : "MhvRcvgYSH2-AAThxmjosQ",
      "source" : {
        "size" : 20,
        "query" : {
          "term" : {
            "user.id" : {
              "value" : "cyji",
              "boost" : 1.0
            }
          }
        }
      },
      "task_resource_usages" : [
        {
          "action" : "indices:data/read/search[phase/query]",
          "taskId" : 250,
          "parentTaskId" : 148,
          "nodeId" : "dfYZt3ZRQXadtHY4XPRoMg",
          "taskResourceUsage" : {
            "cpu_time_in_nanos" : 3152000,
            "memory_in_bytes" : 278544
          }
        },
        {
          "action" : "indices:data/read/search",
          "taskId" : 148,
          "parentTaskId" : -1,
          "nodeId" : "MhvRcvgYSH2-AAThxmjosQ",
          "taskResourceUsage" : {
            "cpu_time_in_nanos" : 599000,
            "memory_in_bytes" : 61736
          }
        }
      ],
      "search_type" : "query_then_fetch",
      "phase_latency_map" : {
        "expand" : 0,
        "query" : 15,
        "fetch" : 0
      },
      "labels" : { },
      "measurements" : {
        "latency" : {
          "number" : 17,
          "count" : 1,
          "aggregationType" : "NONE"
        },
        "cpu" : {
          "number" : 3751000,
          "count" : 1,
          "aggregationType" : "NONE"
        },
        "memory" : {
          "number" : 340280,
          "count" : 1,
          "aggregationType" : "NONE"
        }
      }
    },
    {
      "timestamp" : 1732222067088,
      "indices" : [
        "my-index-*"
      ],
      "total_shards" : 2,
      "node_id" : "Hek0j1IZQ4qfNsw6ftlbTQ",
      "source" : {
        "size" : 1000
      },
      "task_resource_usages" : [
        {
          "action" : "indices:data/read/search[phase/query]",
          "taskId" : 134,
          "parentTaskId" : 133,
          "nodeId" : "Hek0j1IZQ4qfNsw6ftlbTQ",
          "taskResourceUsage" : {
            "cpu_time_in_nanos" : 304000,
            "memory_in_bytes" : 8400
          }
        },
        {
          "action" : "indices:data/read/search[phase/query]",
          "taskId" : 248,
          "parentTaskId" : 133,
          "nodeId" : "dfYZt3ZRQXadtHY4XPRoMg",
          "taskResourceUsage" : {
            "cpu_time_in_nanos" : 309000,
            "memory_in_bytes" : 8400
          }
        },
        {
          "action" : "indices:data/read/search[phase/fetch/id]",
          "taskId" : 136,
          "parentTaskId" : 133,
          "nodeId" : "Hek0j1IZQ4qfNsw6ftlbTQ",
          "taskResourceUsage" : {
            "cpu_time_in_nanos" : 354000,
            "memory_in_bytes" : 21088
          }
        },
        {
          "action" : "indices:data/read/search[phase/fetch/id]",
          "taskId" : 249,
          "parentTaskId" : 133,
          "nodeId" : "dfYZt3ZRQXadtHY4XPRoMg",
          "taskResourceUsage" : {
            "cpu_time_in_nanos" : 359000,
            "memory_in_bytes" : 21088
          }
        },
        {
          "action" : "indices:data/read/search",
          "taskId" : 133,
          "parentTaskId" : -1,
          "nodeId" : "Hek0j1IZQ4qfNsw6ftlbTQ",
          "taskResourceUsage" : {
            "cpu_time_in_nanos" : 1484000,
            "memory_in_bytes" : 193568
          }
        }
      ],
      "search_type" : "query_then_fetch",
      "phase_latency_map" : {
        "expand" : 0,
        "query" : 3,
        "fetch" : 3
      },
      "labels" : { },
      "measurements" : {
        "latency" : {
          "number" : 15,
          "count" : 1,
          "aggregationType" : "NONE"
        },
        "cpu" : {
          "number" : 2810000,
          "count" : 1,
          "aggregationType" : "NONE"
        },
        "memory" : {
          "number" : 252544,
          "count" : 1,
          "aggregationType" : "NONE"
        }
      }
    },
...

Check List

By submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license. For more information on following Developer Certificate of Origin and signing off your commits, please check here.

github-actions[bot] commented 13 hours ago

:x: Gradle check result for 8fc3f92f745a6fd64596ada017a3ec236683d1c5: ABORTED

Please examine the workflow log, locate, and copy-paste the failure(s) below, then iterate to green. Is the failure a flaky test unrelated to your change?