man-group / ArcticDB

ArcticDB is a high performance, serverless DataFrame database built for the Python Data Science ecosystem.
http://arcticdb.io
Other
1.51k stars 93 forks source link

Using 'and' in QueryBuilder queries does not produce error but lead to wrong results returned #1970

Closed grusev closed 2 weeks ago

grusev commented 3 weeks ago

Describe the bug

Another usability note ... while playing with QueryBuilder I noticed that following query works correctly q2 = QueryBuilder() q2 = q2[q2["bool"] & (q2["int8"] > 5)] while following also works, but incorrectly: q2 = QueryBuilder() q2 = q2[q2["bool"] and (q2["int8"] > 5)] 'and' is not supported by QueryBuilder as per doc, thus I think it would be great to eventually raise an error with a message like "QueryBuilder supports 'and' through '&'. Please, correct query". Do not know if such things are planned but would be nice to have at some moment some checks there to prevent accidental wrong results produced. Not sure if this has to be logged as issue ...

Steps/Code to Reproduce

from arcticdb.version_store.library import ReadRequest from arcticdb.version_store.processing import QueryBuilder import pandas as pd from arcticdb.util.test import assert_frame_equal, create_df_index_rownum, create_df_index_datetime, random_string, dataframe_arctic_update, get_sample_dataframe, dataframe_dump_to_log, distinct_timestamps

def assert_frame_equal_row_range_fix(expected : pd.DataFrame, actual : pd.DataFrame): expected.reset_index(inplace = True, drop = True) actual.reset_index(inplace = True, drop = True) assert_frame_equal(expected, actual)

def test_read_batch_query_with_and(arctic_library): lib = arctic_library

symbol = "_s_"
df = get_sample_dataframe(100)

dfq = "bool == True and int8 > 5"
q = QueryBuilder()
q = q[q["bool"] and (q["int8"] > 5)]

lib.write(symbol, df)

batch = lib.read_batch(symbols=[ReadRequest(symbol, query_builder=q)])
df_result = df.query(dfq, inplace=False)

print("pandas")
print(df_result)
print("arctic")
print(batch[0])
print(batch[0].data)

assert_frame_equal_row_range_fix(df_result, batch[0].data)

Expected Results

There should have been some error explaining that "and" is not supported and must use '&' instead

OS, Python Version and ArcticDB Version

Python 3.11 Arctic top of master

Backend storage used

No response

Additional Context

Alex Owens Yesterday at 17:11 When you say it "works" with and , what does it do? & works by overloading the and operator, but it isn't possible to do anything similar with and , so we're slightly at the mercy of Python here 17:13 Actually, maybe we could make it work by overloading bool https://docs.python.org/3/reference/datamodel.html#object.__bool__ Python documentationPython documentation 3. Data model Objects, values and types: Objects are Python’s abstraction for data. All data in a Python program is represented by objects or by relations between objects. (In a sense, and in conformance to Von ...

Alex Owens Yesterday at 17:25 Looking at the docs, all objects evaluate to True in a boolean context unless you overload bool , so I'm guessing it just returns the whole dataframe without any filtering?

Georgi Rusev Yesterday at 18:04 In all cases will return wrong results. In my case it returned the data with only right hand of 'and' applied. That is the reason I think that if this can be prevented it should be. Would save time to people looking what is wrong or eventually prevent them from making wrond extract and thinking those extracts are correct. The last case is worse of course

Alex Owens Yesterday at 18:08 Suspect if we overload bool to throw an exception saying and/or/xor/not operands are not supported and &/|/^/~ should be used instead that would be sufficient