Open bkmgit opened 1 year ago
I think what you have in mind is too early question for RedAmber, however, it is important for users to know how much scale data and what features it has compared to other data frames, thanks!
Since RedAmber is an on-memory, single-threaded, non-streaming, eager execution data frame in Ruby (a dynamic language). It does not look like much fun compared to a data frame that is focused on scalability and execution speed.
Still, I am trying to find out how large data can be handled using https://github.com/h2oai/db-benchmark . Please let me know if you have a better data set to check scalability. (It is written in R and not convenient to use.)
The references you gave me are helpful. I would like to make a comparison chart. At this point, I can easily come up with the following:
By the way, I think the data frame library that RedAmber should be most compared to is Polars. What do you think?
Polars seems to use threads. A comparison chart would be helpful. Perhaps indicate features wish to add. Possibly compare with other data frame implementations. Arrow has flight https://github.com/apache/arrow/tree/master/ruby/red-arrow-flight and UCX can run on distributed memory, so larger datasets might be possible.
Can add RedAmber to the db-benchmark https://github.com/h2oai/db-benchmark/issues/250 then look for larger datasets.
This is the comparison of basic feature between RedAmber and other major DataFrame libraries, comparing only for the method 'verbs' ignoring parameters and options.
Remarks:
1) dataframe
represents 2D data containers such as DataFrame
, tibble
or Table
.
2) vector
represents 1D data containers such as Vector
, Series
or Column
.
Comments or suggestions are welcome!
Features | RedAmber | tidyverse | pandas |
---|---|---|---|
Select columns as a dataframe |
pick, drop, [] | dplyr::select, dplyr::select_if | [], loc[], iloc[], drop, select_dtypes |
Select a column as a vector |
[], v | dplyr::pull | [], loc[], iloc[] |
Move columns to a new position | pick, [] | relocate | [], reindex, loc[], iloc[] |
Features | RedAmber | tidyverse | pandas |
---|---|---|---|
Select rows that meet logical criteria as a dataframe |
slice, remove, [] | dplyr::filter | [], filter, query, loc[] |
Select rows by position as a dataframe |
slice, remove, [] | dplyr::slice | iloc[], drop |
Move rows to a new position | slice, [] | dplyr::filter, dplyr::slice | reindex, loc[], iloc[] |
Features | RedAmber | tidyverse | pandas |
---|---|---|---|
Update existing columns | assign | dplyr::mutate | assign, []= |
Create new columns | assign, assign_left | dplyr::mutate | apply |
Compute new columns, drop others | new | transmute | (dfply:)transmute |
Rename columns | rename | dplyr::rename, dplyr::rename_with, purrr::set_names | rename, set_axis |
Sort dataframe | sort | dplyr::arrange | sort_values |
Features | RedAmber | tidyverse | pandas |
---|---|---|---|
Gather columns into rows (create a longer dataframe ) |
to_long | tidyr::pivot_longer | melt |
Spread rows into columns (create a wider dataframe ) |
to_wide | tidyr::pivot_wider | pivot |
transpose a wide dataframe |
transpose | transpose, t | transpose, T |
Features | RedAmber | tidyverse | pandas |
---|---|---|---|
Grouping | group, group.summarize | dplyr::group_by %>% dplyr::summarise | groupby.agg |
Features | RedAmber | tidyverse | pandas |
---|---|---|---|
Combine additional columns | merge, bind_cols | dplyr::bind_cols | concat |
Combine additional rows | concatenate, concat, bind_rows | dplyr::bind_rows | concat |
Inner join | join, inner_join | dplyr::inner_join | merge |
Full join | join, full_join, outer_join | dplyr::full_join | merge |
Left join | join, left_join | dplyr::left_join | merge |
Right join | join, right_join | dplyr::right_join | merge |
Semi join | join, semi_join | dplyr::semi_join | [isin] |
Anti join | join, anti_join | dplyr::anti_join | [isin] |
Collect rows that appear in x or y | union | dplyr::union | merge |
Collect rows that appear in both x and y | intersect | dplyr::intersect | merge |
Collect rows that appear in x but not y | difference, setdiff | dplyr::setdiff | merge |
This is helpful. Thanks. May also want to compare with Julia where the comparison is part of the documentation.
Can create a pull request with this if of interest.
Yes. It would be nice if this is part of the Document in source tree. I can accept requests for modifications.
Discussed on Arrow mailing list https://github.com/ava6969/panda-arrow.git Possibly also interesting:
It may be helpful to indicate size of datasets that can be used with Red Amber and what operations will be supported. For a comparison with other dataframes, see Table 3 in Towards Scalable Dataframe Systems and https://www.datarevenue.com/en-blog/pandas-vs-dask-vs-vaex-vs-modin-vs-rapids-vs-ray