manticoresoftware / manticoresearch

Easy to use open source fast database for search | Good alternative to Elasticsearch now | Drop-in replacement for E in the ELK soon
https://manticoresearch.com
GNU General Public License v3.0
9.05k stars 509 forks source link

count(distinct ...) against multiple tables with different field order returns wrong value #2076

Open sanikolaev opened 7 months ago

sanikolaev commented 7 months ago

MRE:

drop table if exists a; drop table if exists b;

CREATE TABLE a(g int, g2 int);
CREATE TABLE b(g int, g2 int);

INSERT INTO a(g,g2) VALUES (1,1),(2,1);
INSERT INTO b(g,g2) VALUES (1,1),(2,1);

select count(distinct g) from a,b;

# Returns:
# +-------------------+
# | count(distinct g) |
# +-------------------+
# |                 2 |
# +-------------------+

# Let's now recreate the table b changing the fields order
drop table if exists b;
CREATE TABLE b(g2 int, g int);
INSERT INTO b(g,g2) VALUES (1,1),(2,1);

select count(distinct g) from a,b;

# Returns:
# +-------------------+
# | count(distinct g) |
# +-------------------+
# |                 3 |
# +-------------------+

Version:

6.2.13 7ecf541ab@24041615 dev (columnar 2.2.5 b4f7386@240405) (secondary 2.2.5 b4f7386@240405) (knn 2.2.5 b4f7386@240405)

on dev2


Checklist

To be completed by the assignee. Check off tasks that have been completed or are not applicable.

- [x] Task estimated - [x] Bug reproduced - [x] Specification created, reviewed and approved - [ ] Implementation completed - [ ] Tests developed - [x] Documentation updated - [x] Documentation proofread - [ ] Changelog updated - [x] OpenAPI YAML updated and issue created to rebuild clients
tomatolog commented 7 months ago

seems it designed this way and mentioned at the documentation COUNT(DISTINCT-field)

COUNT(DISTINCT) against a distributed table or a real-time table consisting of multiple disk chunks may return inaccurate results, but the result should be accurate for a distributed table consisting of local plain or real-time tables with the same schema

COUNT(DISTINCT) groupers has two modes:

I am sure there is no easy fix to perform merge of the multiple groupers/sorters with different schema.

I tried the new sorter implementation for the only select count(distinct g) from idx1, idx2 that might work if grouper/sorter has not keep any attributes from the index match however it has multiple constraints:

and this seems not practical.

Maybe the better option is to extract internal distinct values structure into result set that could be reused on result sets merge or transfered from agents into master and used at the master to merge result sets from agents and still get accurate COUNT(DISTINCT) of that merged result set. However that also seems a large change.