raylutz / daffodil

Python Daffodil: 2-D data arrays with mixed types, lightweight package, can be faster than Pandas
MIT License
7 stars 2 forks source link

Review indexing and selection syntax and compare with Pandas and Polars, etc. #1

Closed raylutz closed 4 months ago

raylutz commented 7 months ago

Need to do a full review of syntax and naming conventions.

A simple df['string'] in Pandas means to select the entire column by colname. This is in contrast with Daffodil which uses conventional python indexing, like [row, col], and to select a column, then you use [:, 'string']. In Daffodil, [irow] selects one row by index. Thus, if we have ['string'] it should select the row, not the column. Currently, the code does not support string row indexing due to this inconsistency. Also, would like to just select sets of rows and columns by lists of integers and strings:

my_pydf[['rowx', 'rowy', 'rowz'], ['cola','colb','colc']] should select three rows by name and three columns by name and return 3x3 pydf. For selection of rows using keyfield, then it must be defined or error should occur.

raylutz commented 6 months ago

Created a full review of "all" Pandas functions vs Pydf.

https://docs.google.com/spreadsheets/d/15q9ExKvg83w6ti4-IFiu6xaWDStwUYkvFCd9so7SkLc/edit#gid=0

raylutz commented 6 months ago

R package data.table has a nice graphical format for their "cheat sheet" https://raw.githubusercontent.com/rstudio/cheatsheets/master/datatable.pdf

raylutz commented 6 months ago

Reviewed these other packages related to Daffodil:

Pandas: https://wesmckinney.com/blog/apache-arrow-pandas-internals/

The following summary we prepared regarding the current landscape.

Package Comments
Daffodil python integrated datatable -- list of list underlying data structure, row based. can do what Pandas can't: append rows, faster lookups, good with all data types. Can be smaller and faster than Pandas.
Pandas column oriented, Numpy under the hood. Very well established.
Polars Lightning-fast DF library/in-memory query engine written in Rust -- also column oriented. May be even slower than Pandas for mixed data.
Datatable I was told this is poorly supported, but it does seem to be getting updates.
Xarray ND labeled arrays and datasets
Modin a fast DataFrame for datasets from 1MB to 1TB+, drop in replacement for Pandas
D-Tale Web Client UI for Visualizing Pandas Objects -- not really comparable as an alternative dataframe
LanceDB Full database useful with other dataframe systems (pandas, polars, etc) not dataframe itself.
DuckDB SQLite equivalent for analytical OLAP workloads. Large community and well funded.
PandaPy oriented to financial data and smallish datasets
Pyjanitor APIs for data cleaning - Directed Acyclic Graph DAG for pandas users
RAPIDS cuDF executes end-to-end data pipelines entirely on GPUs
Pandas-vet plugin for Flake8 that checks pandas code; opinionated linting
Ray a low-level framework for parallelizing Python code across processors or clusters
Dask works with data that's too big for memory – uses Pandas syntax ( may be useable with Daffodil Ray says: I tried to use this and I found it can't handle arrays with sparse data, and I actually worked to fill in my data so it would work. Did not get back to it as I found a better way to handle that particular issue
Vaex Pandas with lazy evaluation and memory mapping. RL: Tried to use this but not easy to use with data from s3.
Apache Arrow a dev platform for in-memory analytics; processing columnar data. Speed up reading and writing column-based data instead of using csv. PyArrow is the Python variant of this, and it is adopted for use by Pandas 2.0. The fundamental thing is column-based data to avoid conversion and better handling of strings. There is some performance improvement.
Zarr for storing collections of annotated tensors (Ray: I did not research this)
PySpark Have not yet analyzed this but it may also fit well to help with prepping data prior to sending to this.
raylutz commented 4 months ago

Done.