Closed raylutz closed 4 months ago
Created a full review of "all" Pandas functions vs Pydf.
https://docs.google.com/spreadsheets/d/15q9ExKvg83w6ti4-IFiu6xaWDStwUYkvFCd9so7SkLc/edit#gid=0
R package data.table has a nice graphical format for their "cheat sheet" https://raw.githubusercontent.com/rstudio/cheatsheets/master/datatable.pdf
Reviewed these other packages related to Daffodil:
Pandas: https://wesmckinney.com/blog/apache-arrow-pandas-internals/
The following summary we prepared regarding the current landscape.
Package | Comments |
---|---|
Daffodil | python integrated datatable -- list of list underlying data structure, row based. can do what Pandas can't: append rows, faster lookups, good with all data types. Can be smaller and faster than Pandas. |
Pandas | column oriented, Numpy under the hood. Very well established. |
Polars | Lightning-fast DF library/in-memory query engine written in Rust -- also column oriented. May be even slower than Pandas for mixed data. |
Datatable | I was told this is poorly supported, but it does seem to be getting updates. |
Xarray | ND labeled arrays and datasets |
Modin | a fast DataFrame for datasets from 1MB to 1TB+, drop in replacement for Pandas |
D-Tale | Web Client UI for Visualizing Pandas Objects -- not really comparable as an alternative dataframe |
LanceDB | Full database useful with other dataframe systems (pandas, polars, etc) not dataframe itself. |
DuckDB | SQLite equivalent for analytical OLAP workloads. Large community and well funded. |
PandaPy | oriented to financial data and smallish datasets |
Pyjanitor | APIs for data cleaning - Directed Acyclic Graph DAG for pandas users |
RAPIDS cuDF | executes end-to-end data pipelines entirely on GPUs |
Pandas-vet | plugin for Flake8 that checks pandas code; opinionated linting |
Ray | a low-level framework for parallelizing Python code across processors or clusters |
Dask | works with data that's too big for memory – uses Pandas syntax ( may be useable with Daffodil Ray says: I tried to use this and I found it can't handle arrays with sparse data, and I actually worked to fill in my data so it would work. Did not get back to it as I found a better way to handle that particular issue |
Vaex | Pandas with lazy evaluation and memory mapping. RL: Tried to use this but not easy to use with data from s3. |
Apache Arrow | a dev platform for in-memory analytics; processing columnar data. Speed up reading and writing column-based data instead of using csv. PyArrow is the Python variant of this, and it is adopted for use by Pandas 2.0. The fundamental thing is column-based data to avoid conversion and better handling of strings. There is some performance improvement. |
Zarr | for storing collections of annotated tensors (Ray: I did not research this) |
PySpark | Have not yet analyzed this but it may also fit well to help with prepping data prior to sending to this. |
Done.
Need to do a full review of syntax and naming conventions.
A simple df['string'] in Pandas means to select the entire column by colname. This is in contrast with Daffodil which uses conventional python indexing, like [row, col], and to select a column, then you use [:, 'string']. In Daffodil, [irow] selects one row by index. Thus, if we have ['string'] it should select the row, not the column. Currently, the code does not support string row indexing due to this inconsistency. Also, would like to just select sets of rows and columns by lists of integers and strings:
my_pydf[['rowx', 'rowy', 'rowz'], ['cola','colb','colc']] should select three rows by name and three columns by name and return 3x3 pydf. For selection of rows using keyfield, then it must be defined or error should occur.