Elasticsearch connector, on select from wildcard indices, assumes the first mapping for all matching indices

MusKaya commented 3 years ago

Background Elasticsearch allows us to do wildcard searches, such as: curl 'localhost:9200/stest*/_search?pretty' gets us results from all indices that match stest* pattern.

Problem Statement Trino does also allow such searches however, the ES connectors just assumes the same mapping for all matching indices. For instance, if I have the following indices:

curl 'localhost:9200/stest-1/_search?pretty'
{
  "took" : 2,
  "timed_out" : false,
  "_shards" : {
    "total" : 1,
    "successful" : 1,
    "skipped" : 0,
    "failed" : 0
  },
  "hits" : {
    "total" : {
      "value" : 1,
      "relation" : "eq"
    },
    "max_score" : 1.0,
    "hits" : [
      {
        "_index" : "stest-1",
        "_type" : "_doc",
        "_id" : "1",
        "_score" : 1.0,
        "_source" : {
          "name" : "ali",
          "eye" : "green",
          "sports" : {
            "soccer" : "veryGood",
            "basketball" : "fair"
          }
        }
      }
    ]
  }
}

and

curl 'localhost:9200/stest-2/_search?pretty'
{
  "took" : 2,
  "timed_out" : false,
  "_shards" : {
    "total" : 1,
    "successful" : 1,
    "skipped" : 0,
    "failed" : 0
  },
  "hits" : {
    "total" : {
      "value" : 1,
      "relation" : "eq"
    },
    "max_score" : 1.0,
    "hits" : [
      {
        "_index" : "stest-2",
        "_type" : "_doc",
        "_id" : "1",
        "_score" : 1.0,
        "_source" : {
          "name" : "veli",
          "hair" : "brown",
          "sports" : {
            "soccer" : "veryGood",
            "tennis" : "fair"
          }
        }
      }
    ]
  }
}

then I get the combined result from this search:

curl 'localhost:9200/stest-*/_search?pretty'
{
  "took" : 5,
  "timed_out" : false,
  "_shards" : {
    "total" : 2,
    "successful" : 2,
    "skipped" : 0,
    "failed" : 0
  },
  "hits" : {
    "total" : {
      "value" : 2,
      "relation" : "eq"
    },
    "max_score" : 1.0,
    "hits" : [
      {
        "_index" : "stest-1",
        "_type" : "_doc",
        "_id" : "1",
        "_score" : 1.0,
        "_source" : {
          "name" : "ali",
          "eye" : "green",
          "sports" : {
            "soccer" : "veryGood",
            "basketball" : "fair"
          }
        }
      },
      {
        "_index" : "stest-2",
        "_type" : "_doc",
        "_id" : "1",
        "_score" : 1.0,
        "_source" : {
          "name" : "veli",
          "hair" : "brown",
          "sports" : {
            "soccer" : "veryGood",
            "tennis" : "fair"
          }
        }
      }
    ]
  }
}

On the other hand, Trino gives me a partial result:

My tables/indices:

trino:indices> show tables LIKE 'stest%';
  Table
---------
 stest-1
 stest-2
(2 rows)

Query 20210322_192023_00037_9ybgs, FINISHED, 1 node
http://localhost:8080/ui/query.html?20210322_192023_00037_9ybgs
Splits: 19 total, 19 done (100.00%)
CPU Time: 0.0s total, 1.06K rows/s,   43KB/s, 34% active
Per Node: 0.0 parallelism,    35 rows/s, 1.44KB/s
Parallelism: 0.0
Peak Memory: 0B
0.48 [17 rows, 705B] [35 rows/s, 1.44KB/s]

stest-1 contents

trino:indices> select * from "stest-1";
  eye  | name |               sports
-------+------+------------------------------------
 green | ali  | {basketball=fair, soccer=veryGood}
(1 row)

Query 20210322_192044_00038_9ybgs, FINISHED, 1 node
http://localhost:8080/ui/query.html?20210322_192044_00038_9ybgs
Splits: 17 total, 17 done (100.00%)
CPU Time: 0.0s total,   200 rows/s, 15.4KB/s, 20% active
Per Node: 0.0 parallelism,     3 rows/s,   259B/s
Parallelism: 0.0
Peak Memory: 0B
0.31 [1 rows, 79B] [3 rows/s, 259B/s]

stest-2 contents

trino:indices> select * from "stest-2";
 hair  | name |             sports
-------+------+--------------------------------
 brown | veli | {soccer=veryGood, tennis=fair}
(1 row)

Query 20210322_192046_00039_9ybgs, FINISHED, 1 node
http://localhost:8080/ui/query.html?20210322_192046_00039_9ybgs
Splits: 17 total, 17 done (100.00%)
CPU Time: 0.0s total,   200 rows/s,   15KB/s, 20% active
Per Node: 0.0 parallelism,     3 rows/s,   250B/s
Parallelism: 0.0
Peak Memory: 0B
0.31 [1 rows, 77B] [3 rows/s, 250B/s]

combined result

trino:indices> select * from "stest-*";
  eye  | name |               sports
-------+------+------------------------------------
 NULL  | veli | {basketball=NULL, soccer=veryGood}
 green | ali  | {basketball=fair, soccer=veryGood}
(2 rows)

Query 20210322_192049_00040_9ybgs, FINISHED, 1 node
http://localhost:8080/ui/query.html?20210322_192049_00040_9ybgs
Splits: 18 total, 18 done (100.00%)
CPU Time: 0.0s total,   222 rows/s, 15.3KB/s, 21% active
Per Node: 0.0 parallelism,     6 rows/s,   446B/s
Parallelism: 0.0
Peak Memory: 0B
0.32 [2 rows, 141B] [6 rows/s, 446B/s]

Suggestion We need to create a superset of field mapping of all the matching indices and execute a select statement with that superset. In this example, the result should look as follows (also note how I'm manually building a union query that covers the entire superset of fields by assigning NULL to any field/subfield that is not present in the related index).

trino:indices> with Q as (
            -> select eye, NULL as hair, name,
            -> cast(row(sports.basketball, sports.soccer, NULL) as row (basketball varchar, soccer varchar, tennis varchar))
            -> as sports from "stest-1"
            -> UNION ALL
            -> select NULL as eye, hair, name,
            -> cast(row(NULL, sports.soccer, sports.tennis) as row (basketball varchar, soccer varchar, tennis varchar))
            -> as sports from "stest-2"
            -> ) select * from Q;
  eye  | hair  | name |                     sports
-------+-------+------+-------------------------------------------------
 NULL  | brown | veli | {basketball=NULL, soccer=veryGood, tennis=fair}
 green | NULL  | ali  | {basketball=fair, soccer=veryGood, tennis=NULL}
(2 rows)

Query 20210322_194554_00048_9ybgs, FINISHED, 1 node
http://localhost:8080/ui/query.html?20210322_194554_00048_9ybgs
Splits: 18 total, 18 done (100.00%)
CPU Time: 0.0s total,   250 rows/s,   19KB/s, 20% active
Per Node: 0.0 parallelism,     5 rows/s,   443B/s
Parallelism: 0.0
Peak Memory: 0B
0.35 [2 rows, 156B] [5 rows/s, 443B/s]

MusKaya commented 3 years ago

@bitsondatadev fyi

sumannewton commented 3 years ago

@bitsondatadev @findepi @martint Is this valid? Do we support regex based tables currently? Is this something we should start supporting because elasticsearch supports?

bitsondatadev commented 3 years ago

@bitsondatadev @findepi @martint Is this valid? Do we support regex based tables currently? Is this something we should start supporting because elasticsearch supports?

I remember at one point I had to define an alias to get the wildcard behavior but that may have been on Presto (before the Presto/Trino split) and there's been a lot of changes on the Trino side since. I wonder if @MusKaya's ability to do wildcard comes from passthrough queries or if we actually support wildcard semantics.

It might be fine to do it this way, provided the passthrough queries still use the caching and merging of indices you're adding in #8202. The only questions in my mind is how is that referenced in cache if we currently use the alias name. We would need to figure that part out. It could just be some sort of wildcard signature.

Could you also write a test at the bare minimum to see if the wildcard semantics work with your changes via passthrough queries?

trinodb / trino

Elasticsearch connector, on select from wildcard indices, assumes the first mapping for all matching indices #7381