manticoresoftware / manticoresearch

Easy to use open source fast database for search | Good alternative to Elasticsearch now | Drop-in replacement for E in the ELK soon
https://manticoresearch.com
GNU General Public License v3.0
9.05k stars 507 forks source link

count with groupby and without groupby #1082

Open Lot-Art opened 1 year ago

Lot-Art commented 1 year ago

Describe the bug I think the screenshot explains it better

Screenshot_20230320_004035

To Reproduce

  1. Have an index with ~1,000,000 documents, and a text attribute
  2. run a SELECT count(*) ... WHERE attrib='value' and a 'SELECT attrib, count(*) ... WHERE id>0 GROUP BY attrib`
  3. comapre the counts, they will be different

Expected behavior Counts should be the same same

Describe the environment:

Lot-Art commented 1 year ago

In case it has something to do with cutoff, here is a strange behaviour. See how increasing the cutoff from 1,000 to 10,000 lowers the number of results.

cutoff

sanikolaev commented 1 year ago

Does option accurate_aggregation=1 help?

Lot-Art commented 1 year ago

No it behaves the same. It feels like without MATCH() the result is nondeterministic?

I will do a workaround in the app: I will insert a gibberish word at the end of each document, and when the user searches without a keyword, the app will search for that gibberish word. That should return all documents (that would be the expected behavior in my opinion). I will post my finding here.

sanikolaev commented 1 year ago

Please remove the filter by id an try again

Lot-Art commented 1 year ago

Without id the result is the same as on the screenshot (incorrect count). Im running all these in the console. It's an RT index.

SIMPLE

SELECT count(*) FROMla8_items_upcaftWHERE struct_category = 'modernart' OPTION cutoff=0; 15933

SELECT count(*) FROMla8_items_upcaftWHERE struct_category = 'modernart' OPTION cutoff=1000; 60117

SELECT count(*) FROMla8_items_upcaftWHERE struct_category = 'modernart' OPTION cutoff=10000; 15933

SELECT count(*) FROMla8_items_upcaftWHERE struct_category = 'modernart' OPTION cutoff=0, accurate_aggregation=1; 15933

SELECT count(*) FROMla8_items_upcaftWHERE struct_category = 'modernart' OPTION cutoff=1000, accurate_aggregation=1; 60117

SELECT count(*) FROMla8_items_upcaftWHERE struct_category = 'modernart' OPTION cutoff=10000, accurate_aggregation=1; 15933

GROUP BY

SELECT struct_category, count(*) FROMla8_items_upcaftGROUP BY struct_category; modernart | 60117

SELECT struct_category, count(*) FROMla8_items_upcaftGROUP BY struct_category; modernart | 60117

SELECT struct_category, count(*) FROMla8_items_upcaftGROUP BY struct_category OPTION cutoff=0; modernart | 60117

SELECT struct_category, count(*) FROMla8_items_upcaftGROUP BY struct_category OPTION cutoff=1000; modernart | 60117

SELECT struct_category, count(*) FROMla8_items_upcaftGROUP BY struct_category OPTION cutoff=10000; modernart | 60117

SELECT struct_category, count(*) FROMla8_items_upcaftGROUP BY struct_category OPTION cutoff=0, accurate_aggregation=1; modernart | 60117

SELECT struct_category, count(*) FROMla8_items_upcaftGROUP BY struct_category OPTION cutoff=1000, accurate_aggregation=1; modernart | 60117

SELECT struct_category, count(*) FROMla8_items_upcaftGROUP BY struct_category OPTION cutoff=10000, accurate_aggregation=1; modernart | 60117

(cutoff=10)

SELECT struct_category, count(*) FROMla8_items_upcaftGROUP BY struct_category OPTION cutoff=10, accurate_aggregation=1; modernart | 10557 (and only 10 lines are there as expected)

sanikolaev commented 1 year ago

Please show the output of

SELECT struct_category, count(*) FROM la8_items_upcaft GROUP BY struct_category OPTION accurate_aggregation=1
Lot-Art commented 1 year ago

Here it is (some docs has been added since my opening post, but the different counts remain the problem)

MySQL [(none)]> SELECT struct_category, count(*) FROM la8_items_upcaft GROUP BY struct_category OPTION accurate_aggregation=1;
+--------------------+----------+
| struct_category    | count(*) |
+--------------------+----------+
| bookmanuscr        |    15288 |
| silverware         |    21259 |
| watchclock         |    24968 |
| modernart          |    60455 |
| antiqfurn          |    39999 |
| jewel              |    93624 |
| contempart         |    54501 |
| misc               |   385266 |
| ethnoart           |      933 |
| car                |     1419 |
| art_unknown_period |    30671 |
| designfurn         |    20063 |
| collectibles       |    41006 |
| asianart           |    13578 |
| tableware          |    17937 |
| carpetrug          |    17353 |
| glassware          |     9964 |
| arms               |     6676 |
| oldmasterart       |     9583 |
| ancientart         |      703 |
+--------------------+----------+
20 rows in set (0.032 sec)

That is (maybe) correct. And then this is wrong

MySQL [(none)]> SELECT count(*) FROM `la8_items_upcaft` WHERE struct_category = 'modernart' OPTION accurate_aggregation=1;
+----------+
| count(*) |
+----------+
|    17571 |
+----------+
1 row in set (0.007 sec)
sanikolaev commented 1 year ago

What if you add threads=1 to the both queries?

Lot-Art commented 1 year ago

Here it is

MySQL [(none)]> SELECT count(*) FROM `la8_items_upcaft` WHERE struct_category = 'modernart' OPTION accurate_aggregation=1, threads=1;
+----------+
| count(*) |
+----------+
|    17773 |
+----------+
1 row in set (0.012 sec)

MySQL [(none)]> SELECT struct_category, count(*) FROM la8_items_upcaft GROUP BY struct_category OPTION accurate_aggregation=1, threads=1;
+--------------------+----------+
| struct_category    | count(*) |
+--------------------+----------+
| bookmanuscr        |    15290 |
| silverware         |    21285 |
| art_unknown_period |    30767 |
| watchclock         |    25014 |
| modernart          |    60548 |
| antiqfurn          |    40132 |
| jewel              |    93737 |
| contempart         |    54578 |
| misc               |   386485 |
| ethnoart           |      945 |
| car                |     1419 |
| carpetrug          |    17367 |
| designfurn         |    20110 |
| collectibles       |    41043 |
| asianart           |    13607 |
| tableware          |    17968 |
| oldmasterart       |     9592 |
| glassware          |     9993 |
| arms               |     6690 |
| ancientart         |      703 |
+--------------------+----------+
20 rows in set (0.065 sec)

I also tried with only threads without accurate_aggregation, same result

sanikolaev commented 1 year ago

Looks really wrong. Can you share the table and your config with us by sending it to our write-only s3?

sanikolaev commented 1 year ago

@Lot-Art Hi. Any news on this? We'd really like to look deeper into the issue, but we need your help.

Lot-Art commented 1 year ago

Sorry for being absent, I was looking into giving it to you. This is in production, containing proprietary data from different sources. Sharing is a big legal hassle even with an NDA, so unfortunately I can’t send it to anyone, not even privately. But i will rebuild the index on Monday night and will tell you if that fixed it. Thank you for the help so far.

sanikolaev commented 1 year ago

I see. Can you please check the table with indextool --check before you rebuild it?

stale[bot] commented 1 year ago

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions. Feel free to re-open the issue in case it becomes actual.