red-data-tools / red_amber

A dataframe library for Rubyists.
MIT License
64 stars 11 forks source link

Dataset size #145

Open bkmgit opened 1 year ago

bkmgit commented 1 year ago

It may be helpful to indicate size of datasets that can be used with Red Amber and what operations will be supported. For a comparison with other dataframes, see Table 3 in Towards Scalable Dataframe Systems and https://www.datarevenue.com/en-blog/pandas-vs-dask-vs-vaex-vs-modin-vs-rapids-vs-ray

heronshoes commented 1 year ago

I think what you have in mind is too early question for RedAmber, however, it is important for users to know how much scale data and what features it has compared to other data frames, thanks!

1. data size

Since RedAmber is an on-memory, single-threaded, non-streaming, eager execution data frame in Ruby (a dynamic language). It does not look like much fun compared to a data frame that is focused on scalability and execution speed.

Still, I am trying to find out how large data can be handled using https://github.com/h2oai/db-benchmark . Please let me know if you have a better data set to check scalability. (It is written in R and not convenient to use.)

2. possible operations

The references you gave me are helpful. I would like to make a comparison chart. At this point, I can easily come up with the following:

By the way, I think the data frame library that RedAmber should be most compared to is Polars. What do you think?

bkmgit commented 1 year ago

Polars seems to use threads. A comparison chart would be helpful. Perhaps indicate features wish to add. Possibly compare with other data frame implementations. Arrow has flight https://github.com/apache/arrow/tree/master/ruby/red-arrow-flight and UCX can run on distributed memory, so larger datasets might be possible.

bkmgit commented 1 year ago

Can add RedAmber to the db-benchmark https://github.com/h2oai/db-benchmark/issues/250 then look for larger datasets.

heronshoes commented 1 year ago

Comparing features between RedAmber, dplyr/tidyr and pandas

This is the comparison of basic feature between RedAmber and other major DataFrame libraries, comparing only for the method 'verbs' ignoring parameters and options.

Remarks: 1) dataframe represents 2D data containers such as DataFrame, tibble or Table. 2) vector represents 1D data containers such as Vector, Series or Column.

Comments or suggestions are welcome!

Select columns (variables)

Features RedAmber tidyverse pandas
Select columns as a dataframe pick, drop, [] dplyr::select, dplyr::select_if [], loc[], iloc[], drop, select_dtypes
Select a column as a vector [], v dplyr::pull [], loc[], iloc[]
Move columns to a new position pick, [] relocate [], reindex, loc[], iloc[]

Select rows (records, observations)

Features RedAmber tidyverse pandas
Select rows
that meet logical criteria as a dataframe
slice, remove, [] dplyr::filter [], filter, query, loc[]
Select rows
by position as a dataframe
slice, remove, [] dplyr::slice iloc[], drop
Move rows to a new position slice, [] dplyr::filter, dplyr::slice reindex, loc[], iloc[]

Update columns / create new columns

Features RedAmber tidyverse pandas
Update existing columns assign dplyr::mutate assign, []=
Create new columns assign, assign_left dplyr::mutate apply
Compute new columns, drop others new transmute (dfply:)transmute
Rename columns rename dplyr::rename, dplyr::rename_with, purrr::set_names rename, set_axis
Sort dataframe sort dplyr::arrange sort_values

Reshape dataframe

Features RedAmber tidyverse pandas
Gather columns into rows
(create a longer dataframe)
to_long tidyr::pivot_longer melt
Spread rows into columns
(create a wider dataframe)
to_wide tidyr::pivot_wider pivot
transpose a wide dataframe transpose transpose, t transpose, T

Grouping

Features RedAmber tidyverse pandas
Grouping group, group.summarize dplyr::group_by %>% dplyr::summarise groupby.agg

Combine dataframes or tables

Features RedAmber tidyverse pandas
Combine additional columns merge, bind_cols dplyr::bind_cols concat
Combine additional rows concatenate, concat, bind_rows dplyr::bind_rows concat
Inner join join, inner_join dplyr::inner_join merge
Full join join, full_join, outer_join dplyr::full_join merge
Left join join, left_join dplyr::left_join merge
Right join join, right_join dplyr::right_join merge
Semi join join, semi_join dplyr::semi_join [isin]
Anti join join, anti_join dplyr::anti_join [isin]
Collect rows that appear in x or y union dplyr::union merge
Collect rows that appear in both x and y intersect dplyr::intersect merge
Collect rows that appear in x but not y difference, setdiff dplyr::setdiff merge
bkmgit commented 1 year ago

This is helpful. Thanks. May also want to compare with Julia where the comparison is part of the documentation.

bkmgit commented 1 year ago

Can create a pull request with this if of interest.

heronshoes commented 1 year ago

Yes. It would be nice if this is part of the Document in source tree. I can accept requests for modifications.

bkmgit commented 1 year ago

Discussed on Arrow mailing list https://github.com/ava6969/panda-arrow.git Possibly also interesting: