man-group / ArcticDB

ArcticDB is a high performance, serverless DataFrame database built for the Python Data Science ecosystem.
http://arcticdb.io
Other
1.51k stars 93 forks source link

Order of returned dataframe columns does not match to the order of columns passed via list parameter 'columns' in read() and ReadRequest() #2004

Open grusev opened 6 days ago

grusev commented 6 days ago

Describe the bug

The documentation does not specify well 'columns' attribute. Therefore sometimes implicit expectation is that if you pass it, you will get dataframe with order of columns as they have been specified in the passed list.

i.e. if you pass a list with column names ['col_234', 'col_13', 'col_567', 'col_182'] you would expect a dataframe with same odredered columns to be returned and not DF where the column names are the same order as they have been defined.

In large DF the you cannot remember the way you have defined the order of the columns. Thus this is highly unexpected behavior

Currently 'column' serves more like a filter field - you want to have those columns returned, order is not important and will be the way when symbol was defined.

That is also OK but at least must be documented, which is not cyrrently

I am opening this issue to track our deicision. There is already a test case for that

Steps/Code to Reproduce

def test_read_batch_query_and_columns_returned_order(arctic_library): ''' Column order is expected to match the 'columns' attribute lits '''

def q(q):
    return q[q["bool"]]

lib = arctic_library

symbol = "sym"
df = get_sample_dataframe(size=100)
df.reset_index(inplace = True, drop = True)
columns = ['int32', 'float64', 'strings', 'bool']

lib.write(symbol, df)

batch = lib.read_batch(symbols=[ReadRequest(symbol, as_of=0, query_builder=q(QueryBuilder()), columns=columns)])

df_filtered = q(df)[columns]
assert_frame_equal_rebuild_index_first(df_filtered, batch[0].data)

Expected Results

The order of columns in returned dataframe to match the order of columns or to document this well that the order in which we return the dataframe columns will ways be the one we defined when the symbol was created

OS, Python Version and ArcticDB Version

any

Backend storage used

No response

Additional Context

No response