raylutz / daffodil

Python Daffodil: 2-D data arrays with mixed types, lightweight package, can be faster than Pandas
MIT License
7 stars 2 forks source link

Provides means to assign columns and dtypes using indexing syntax #8

Closed raylutz closed 4 months ago

raylutz commented 6 months ago

It will be very convenient to provide access to the colnames and dtypes (and other similar metadata that is linked to the columns) by using the indexing syntax. This can be done by using special names to avoid row-key collision and treating these like special rows of data.

The special names can avoid collisions with the row keys.

Assigning column names (in addition to when daf instance is created)

my_daf["$colnames"] = list_of_colnames

my_daf["$colnames"] = [f"col{idx}" for idx in my_daf.num_cols()]

my_daf["$colnames"] = Daf.gen_A1_colnames(my_daf.num_cols())      # create spreadsheet type column names

my_daf["$colnames"] = dict     # The keys are used to initialize the columns

Then reading colnames:

colnames = my_daf["$colnames"]        # instead of my_daf.columns()    

Then for dtypes:

my_daf["$dtypes"] = type        # all columns are initialized with the same type

my_daf["$dtypes"] = list          # each column is assigned the value from the list.

my_daf["$dtypes"] = dict         # the row is initialized from the values according to the keys, like normal dict assignment to a row.

For example, if say three columns are not numeric data, and the rest are integers:

# first assign all values as integer type should probably allow both pure types and quoted types.
 my_daf[$dtypes"] = 'int'           

# then assign a few columns to specific type
my_daf["$dtypes", ['mainkey, 'metadata1', 'metadata2']] = 'str'

Given a dtypes_dict, initialize the columns and dtypes:

my_daf["$colnames"] = dtypes_dict
my_daf["$dtypes"]      = dtypes_dict

Get the current colnames as a list:

colnames = my_daf[$colnames"]    # this always returns a list and not a daf with a row.

dtypes_dict = my_daf[$dtypes]       # normally returns a dict. This is normally what is required for other uses.
dtypes_list  = my_daf[$dtypes].values()  # the list of types in order.

We need to know what colnames have int datatypes

def keys_with_value(dictionary, value):
    return [key for key, val in dictionary.items() if val == value]

int_cols = my_daf["$dtypes"].keys_with_value('int')

Can build a reverse-lookup structure

dtypes_to_cols = utils.invert_da_to_dola(my_daf[$dtypes])

""" given a dict of vals where vals may be repeated,
    create a dict, where the keys are the vals and the
    values are lists of the prior keys. 
    This allows reverse lookup of key based on values in the list.
"""

This function is similar to "value_counts()" operation but goes one step further and provides the keys where the values are found.

(I've need this frequently in data analysis for metadata type data esp. when that data needs to be correlated or when at least a few examples of when it occurs can be provided.)

For example, let's say we need the column names of a specific dtype:

int_cols = dtypes_to_cols['int']

For a given daf, it might be worth creating the reverse-lookup dola structure and saving in cache, or simply searching each time For now, we can just do the lookup each time, as the dtypes data is not too voluminous.

raylutz commented 6 months ago

Another option will be to provide cols and dtypes as property attributes, and use similar code to provide for indexing, including slicing, using lists of colnames, etc. Near term solution can be to provide methods that will provide the same flexibility, without resolving the [ ] syntax.

raylutz commented 4 months ago

This issue as been set aside for now.