count(distinct ...) against multiple tables with different field order returns wrong value

MRE:

drop table if exists a; drop table if exists b;

CREATE TABLE a(g int, g2 int);
CREATE TABLE b(g int, g2 int);

INSERT INTO a(g,g2) VALUES (1,1),(2,1);
INSERT INTO b(g,g2) VALUES (1,1),(2,1);

select count(distinct g) from a,b;

# Returns:
# +-------------------+
# | count(distinct g) |
# +-------------------+
# |                 2 |
# +-------------------+

# Let's now recreate the table b changing the fields order
drop table if exists b;
CREATE TABLE b(g2 int, g int);
INSERT INTO b(g,g2) VALUES (1,1),(2,1);

select count(distinct g) from a,b;

# Returns:
# +-------------------+
# | count(distinct g) |
# +-------------------+
# |                 3 |
# +-------------------+

Version:

6.2.13 7ecf541ab@24041615 dev (columnar 2.2.5 b4f7386@240405) (secondary 2.2.5 b4f7386@240405) (knn 2.2.5 b4f7386@240405)

on dev2

Checklist

^{To be completed by the assignee. Check off tasks that have been completed or are not applicable.}

- [x] Task estimated - [x] Bug reproduced - [x] Specification created, reviewed and approved - [ ] Implementation completed - [ ] Tests developed - [x] Documentation updated - [x] Documentation proofread - [ ] Changelog updated - [x] OpenAPI YAML updated and issue created to rebuild clients

seems it designed this way and mentioned at the documentation COUNT(DISTINCT-field)

COUNT(DISTINCT) against a distributed table or a real-time table consisting of multiple disk chunks may return inaccurate results, but the result should be accurate for a distributed table consisting of local plain or real-time tables with the same schema

COUNT(DISTINCT) groupers has two modes:

collect matches inside sorters and merge matches from all sorters into single one - this allow to keep distinct internal structure and count distinct values correct. However this works only if all matches in all sorters has the same schema.
collect matches then finalize them from the sorters into result sets. This operation calculates aggregates and finalize count distinct to single counter per group. After that all these result sets could be processed to fit into common schema and transforms into that final schema after these result sets could be merged together however information of distinct values in the each group is lost - result set has only counter.

I am sure there is no easy fix to perform merge of the multiple groupers/sorters with different schema.

I tried the new sorter implementation for the only select count(distinct g) from idx1, idx2 that might work if grouper/sorter has not keep any attributes from the index match however it has multiple constraints:

no filter
no sort
no group by
no json or string attribute used for count(distinct )

and this seems not practical.

Maybe the better option is to extract internal distinct values structure into result set that could be reused on result sets merge or transfered from agents into master and used at the master to merge result sets from agents and still get accurate COUNT(DISTINCT) of that merged result set. However that also seems a large change.

manticoresoftware / manticoresearch

count(distinct ...) against multiple tables with different field order returns wrong value #2076

Checklist