stoneatom / stonedb

StoneDB is an Open-Source MySQL HTAP and MySQL-Native DataBase for OLTP, Real-Time Analytics, a counterpart of MySQLHeatWave. (https://stonedb.io)
https://stonedb.io/
GNU General Public License v2.0
861 stars 140 forks source link

bug: Error result set of the IN subquery with semi join #1764

Closed adofsauron closed 1 year ago

adofsauron commented 1 year ago

Have you read the Contributing Guidelines on issues?

Please confirm if bug report does NOT exists already ?

Describe the problem

No response

Expected behavior

No response

How To Reproduce

No response

Environment

No response

Are you interested in submitting a PR to solve the problem?

adofsauron commented 1 year ago

"query_block": { "select_id": 1, "cost_info": { "query_cost": "337.15" }, "nested_loop": [ { "table": { "table_name": "a", "access_type": "ALL", "rows_examined_per_scan": 1016, "rows_produced_per_join": 20, "filtered": "2.00", "pushed_condition": "((mbs.a.DELETED_FLAG = '0') and (mbs.a.ACCT_FLAG in ('2','4')))(t0) Pckrows: 1, susp. 1 (0 empty 0 full). Conditions: 2", "cost_info": { "read_cost": "259.94", "eval_cost": "4.06", "prefix_cost": "264.00", "data_read_per_join": "314K" }, "used_columns": [ "COMPANY_ID", "BANK_ID", "BANK_NAME", "HBANK_CODE", "HBANK_NAME", "ACCT_FLAG", "DELETED_FLAG" ] } }, { "table": { "table_name": "t1", "access_type": "ALL", "possible_keys": [ "PRIMARY", "IDX_MDCOMPANY1" ], "rows_examined_per_scan": 9, "rows_produced_per_join": 20, "filtered": "11.11", "using_join_buffer": "Block Nested Loop", "cost_info": { "read_cost": "20.32", "eval_cost": "4.06", "prefix_cost": "288.38", "data_read_per_join": "155K" }, "used_columns": [ "ROW_ID", "TENANT_ID" ], "attached_condition": "(mbs.t1.ROW_ID = mbs.a.COMPANY_ID)" } }, { "table": { "table_name": "t2", "access_type": "ALL", "possible_keys": [ "IDX_MDCOMPANY1" ], "rows_examined_per_scan": 9, "rows_produced_per_join": 20, "filtered": "22.22", "first_match": "t1", "using_join_buffer": "Block Nested Loop", "cost_info": { "read_cost": "40.64", "eval_cost": "4.06", "prefix_cost": "337.15", "data_read_per_join": "155K" }, "used_columns": [ "TENANT_ID" ], "attached_condition": "(mbs.t2.TENANT_ID = mbs.t1.TENANT_ID)" } } ] } } |

adofsauron commented 1 year ago

"nested_loop": [ { "table": { "table_name": "t1", "access_type": "index", "possible_keys": [ "PRIMARY", "IDX_MDCOMPANY1" ], "key": "IDX_MDCOMPANY1", "used_key_parts": [ "TENANT_ID", "COMPANY_NAME" ], "key_length": "1210", "rows_examined_per_scan": 9, "rows_produced_per_join": 9, "filtered": "100.00", "using_index": true, "cost_info": { "read_cost": "1.00", "eval_cost": "1.80", "prefix_cost": "2.80", "data_read_per_join": "68K" }, "used_columns": [ "ROW_ID", "TENANT_ID" ] } },

adofsauron commented 1 year ago

EXISTS does not care about the selected expressions, only about the existence of rows.

If UNION, we have to modify the SELECT list of each SELECT in the UNION, fortunately this function is indeed called for each SELECT_LEX.

If this is a prepared statement, we must allow the next execution to use materialization. So, we should back up the original SELECT list. If this is a UNION, this means backing up the N original SELECT lists. To avoid this constraint, we change the SELECT list only if this is not a prepared statement.

adofsauron commented 1 year ago

$6 = {

= {}, = { m_qs = 0x7fd25ca92178 }, members of QEP_TAB: table_ref = 0x7fd25c180b68, flush_weedout_table = 0x0, check_weed_out_table = 0x0, firstmatch_return = -2 '\376', loosescan_key_len = 0, loosescan_buf = 0x0, match_tab = -2 '\376', found_match = false, found = false, not_null_compl = false, first_unmatched = -2 '\376', materialized = false, materialize_table = 0x0, read_first_record = 0x23de24f , next_select = 0x23df384 , read_record = { table = 0x7fd25c184360, forms = 0x0, unlock_row = 0x23425ef , read_record = 0x0, thd = 0x0, quick = 0x0, cache_records = 0, ref_length = 0, struct_length = 0, reclength = 0, rec_cache_size = 0, error_offset = 0, unpack_counter = 0, ref_pos = 0x0, record = 0x0, rec_buf = 0x0, cache = 0x0, cache_pos = 0x0, cache_end = 0x0, read_positions = 0x0, io_cache = 0x0, print_error = false, --Type for more, q to quit, c to continue without paging-- ignore_not_found_rows = false }, save_read_first_record = 0x0, save_read_record = 0x0, used_null_fields = false, used_uneven_bit_fields = false, keep_current_rowid = false, copy_current_rowid = 0x0, distinct = false, not_used_in_distinct = false, cache_idx_cond = 0x0, having = 0x0, op = 0x0, tmp_table_param = 0x0, filesort = 0x0, fields = 0x7fd25c1763b0, all_fields = 0x7fd25c1763d8, ref_array = 0x0, send_records = 0, quick_traced_before = false, m_condition_optim = 0x7fd25ca92698, m_quick_optim = 0x0, m_keyread_optim = false }
adofsauron commented 1 year ago

semi join is wrong. Shut this stupid thing down

adofsauron commented 1 year ago

Although the semi join results are all wrong now, the bug list can not be closed, we must track when it is completely solved

adofsauron commented 1 year ago
Optimize each of the semi-join nests that can be run with
materialization. For each of the nests, we
 - Generate the best join order for this "sub-join" and remember it;
 - Remember the sub-join execution cost (it's part of materialization
   cost);
 - Calculate other costs that will be incurred if we decide
   to use materialization strategy for this semi-join nest.
adofsauron commented 1 year ago

/// Optimal join order calculated for inner tables of this semijoin op. POSITION *positions{nullptr}; /// True if data types allow the MaterializeLookup semijoin strategy bool lookup_allowed{false}; /// True if data types allow the MaterializeScan semijoin strategy bool scan_allowed{false}; /// Expected number of rows in the materialized table double expected_rowcount{0.0}; /// Materialization cost - execute sub-join and write rows to temp.table Cost_estimate materialization_cost; /// Cost to make one lookup in the temptable Cost_estimate lookup_cost; /// Cost of scanning the materialized table Cost_estimate scan_cost; /// Array of pointers to fields in the

adofsauron commented 1 year ago

| -> Hash semijoin (d.d1 = c.c1) (cost=1.27 rows=3) -> Table scan on c (cost=0.55 rows=3) -> Hash -> Table scan on d (cost=0.12 rows=3)

adofsauron commented 1 year ago
          "from": "IN (SELECT)",
          "to": "semijoin",
          "chosen": true,
          "transformation_to_semi_join": {
            "subquery_predicate": "`c`.`c1` in (/* select#2 */ select `d`.`d1` from `d` where (`c`.`c1` = `d`.`d1`))",
            "embedded in": "WHERE",
            "evaluating_constant_semijoin_conditions": [
            ],
            "semi-join condition": "((`c`.`c1` = `d`.`d1`) and (`c`.`c1` = `d`.`d1`))",
            "decorrelated_predicates": [
              {
                "outer": "`c`.`c1`",
                "inner": "`d`.`d1`"
              },
              {
                "outer": "`c`.`c1`",
                "inner": "`d`.`d1`"
              }
adofsauron commented 1 year ago

opt: strategy: "DuplicatesWeedout"

adofsauron commented 1 year ago
= { = { = { = { > = { _vptr.Parse_tree_node_tmpl = 0x850c7b8 , contextualized = true }, members of Item: next_free = 0x7fae54b35078, str_value = { m_ptr = 0x0, m_length = 0, m_charset = 0x8634840 , m_alloced_length = 0, m_is_alloced = false }, collation = { collation = 0x863d940 , derivation = DERIVATION_NUMERIC, repertoire = 1 }, item_name = { = { = { m_str = 0x0, m_length = 0 }, }, members of Item_name_string: m_is_autogenerated = true }, orig_name = { = { = { m_str = 0x0, m_length = 0 }, }, members of Item_name_string: m_is_autogenerated = true }, max_length = 1, marker = Item::MARKER_NONE, cmp_context = INVALID_RESULT, --Type for more, q to quit, c to continue without paging-- m_ref_count = 0, is_parser_item = true, is_expensive_cache = -1 '\377', m_data_type = 8 '\b', fixed = true, decimals = 0 '\000', m_nullable = true, null_value = false, unsigned_flag = false, m_is_window_function = false, hidden = false, m_in_check_constraint_exec_ctx = false, static PROP_SUBQUERY = 1 '\001', static PROP_STORED_PROGRAM = 2 '\002', static PROP_AGGREGATION = 4 '\004', static PROP_WINDOW_FUNCTION = 8 '\b', static PROP_ROLLUP_EXPR = 16 '\020', static PROP_GROUPING_FUNC = 32 ' ', m_accum_properties = 1 '\001' }, members of Item_result_field: result_field = 0x0 }, members of Item_subselect: value_assigned = false, traced_before = false, substitution = 0x0, unit = 0x7fae54b3fc68, in_cond_of_tab = -2, subquery = Python Exception Unsupported implementation for unique_ptr: std::__uniq_ptr_data, true, true>: { _M_t = { >> = { _M_t = std::tuple containing = { [1] = 0x7fae54b1fa30, [2] = {} } }, } }, indexsubquery_engine = 0x0, used_tables_cache = 2305843009213693952, max_columns = 1, parsing_place = CTX_WHERE, have_to_be_excluded = false, changed = true --Type for more, q to quit, c to continue without paging-- }, members of Item_exists_subselect: value = false, sj_convert_priority = 131072, strategy = Subquery_strategy::SEMIJOIN, outer_condition_context = enum_condition_context::ANDS, embedding_join_nest = 0x0, value_transform = Item::BOOL_IS_TRUE, implicit_is_op = true, can_do_aj = false }, members of Item_in_subselect: left_expr = 0x7fae54b3fb20, left_expr_cache = 0x0, left_expr_cache_filled = false, need_expr_cache = true, m_injected_left_expr = 0x0, optimizer = 0x0, was_null = false, abort_on_null = true, in2exists_info = 0x0, pushed_cond_guards = 0x0, upper_item = 0x0, pt_subselect = 0x7fae54b34ee0 }
adofsauron commented 1 year ago

emb_join_list->push_front(sj_nest);