bkmgit commented 1 year ago

It may be helpful to indicate size of datasets that can be used with Red Amber and what operations will be supported. For a comparison with other dataframes, see Table 3 in Towards Scalable Dataframe Systems and https://www.datarevenue.com/en-blog/pandas-vs-dask-vs-vaex-vs-modin-vs-rapids-vs-ray

heronshoes commented 1 year ago

I think what you have in mind is too early question for RedAmber, however, it is important for users to know how much scale data and what features it has compared to other data frames, thanks!

1. data size

Since RedAmber is an on-memory, single-threaded, non-streaming, eager execution data frame in Ruby (a dynamic language). It does not look like much fun compared to a data frame that is focused on scalability and execution speed.

Still, I am trying to find out how large data can be handled using https://github.com/h2oai/db-benchmark . Please let me know if you have a better data set to check scalability. (It is written in R and not convenient to use.)

2. possible operations

The references you gave me are helpful. I would like to make a comparison chart. At this point, I can easily come up with the following:

Lazy execution: possible in the future since Arrow has a mechanism (Acero).
Parallel execution: Next step after establishing a basic API. Grouping is a good match for parallel execution and Ruby's iterators, so I would like to work on it first.

By the way, I think the data frame library that RedAmber should be most compared to is Polars. What do you think?

bkmgit commented 1 year ago

Polars seems to use threads. A comparison chart would be helpful. Perhaps indicate features wish to add. Possibly compare with other data frame implementations. Arrow has flight https://github.com/apache/arrow/tree/master/ruby/red-arrow-flight and UCX can run on distributed memory, so larger datasets might be possible.

bkmgit commented 1 year ago

Can add RedAmber to the db-benchmark https://github.com/h2oai/db-benchmark/issues/250 then look for larger datasets.

heronshoes commented 1 year ago

Comparing features between RedAmber, dplyr/tidyr and pandas

This is the comparison of basic feature between RedAmber and other major DataFrame libraries, comparing only for the method 'verbs' ignoring parameters and options.

Remarks: 1) dataframe represents 2D data containers such as DataFrame, tibble or Table. 2) vector represents 1D data containers such as Vector, Series or Column.

Comments or suggestions are welcome!

Select columns (variables)

Features	RedAmber	tidyverse	pandas
Select columns as a `dataframe`	pick, drop, []	dplyr::select, dplyr::select_if	[], loc[], iloc[], drop, select_dtypes
Select a column as a `vector`	[], v	dplyr::pull	[], loc[], iloc[]
Move columns to a new position	pick, []	relocate	[], reindex, loc[], iloc[]

Select rows (records, observations)

Features	RedAmber	tidyverse	pandas
Select rows that meet logical criteria as a `dataframe`	slice, remove, []	dplyr::filter	[], filter, query, loc[]
Select rows by position as a `dataframe`	slice, remove, []	dplyr::slice	iloc[], drop
Move rows to a new position	slice, []	dplyr::filter, dplyr::slice	reindex, loc[], iloc[]

Update columns / create new columns

Features	RedAmber	tidyverse	pandas
Update existing columns	assign	dplyr::mutate	assign, []=
Create new columns	assign, assign_left	dplyr::mutate	apply
Compute new columns, drop others	new	transmute	(dfply:)transmute
Rename columns	rename	dplyr::rename, dplyr::rename_with, purrr::set_names	rename, set_axis
Sort dataframe	sort	dplyr::arrange	sort_values

Reshape dataframe

Features	RedAmber	tidyverse	pandas
Gather columns into rows (create a longer `dataframe`)	to_long	tidyr::pivot_longer	melt
Spread rows into columns (create a wider `dataframe`)	to_wide	tidyr::pivot_wider	pivot
transpose a wide `dataframe`	transpose	transpose, t	transpose, T

Grouping

Features	RedAmber	tidyverse	pandas
Grouping	group, group.summarize	dplyr::group_by %>% dplyr::summarise	groupby.agg

Combine dataframes or tables

Features	RedAmber	tidyverse	pandas
Combine additional columns	merge, bind_cols	dplyr::bind_cols	concat
Combine additional rows	concatenate, concat, bind_rows	dplyr::bind_rows	concat
Inner join	join, inner_join	dplyr::inner_join	merge
Full join	join, full_join, outer_join	dplyr::full_join	merge
Left join	join, left_join	dplyr::left_join	merge
Right join	join, right_join	dplyr::right_join	merge
Semi join	join, semi_join	dplyr::semi_join	[isin]
Anti join	join, anti_join	dplyr::anti_join	[isin]
Collect rows that appear in x or y	union	dplyr::union	merge
Collect rows that appear in both x and y	intersect	dplyr::intersect	merge
Collect rows that appear in x but not y	difference, setdiff	dplyr::setdiff	merge

bkmgit commented 1 year ago

This is helpful. Thanks. May also want to compare with Julia where the comparison is part of the documentation.

bkmgit commented 1 year ago

Can create a pull request with this if of interest.

heronshoes commented 1 year ago

Yes. It would be nice if this is part of the Document in source tree. I can accept requests for modifications.

bkmgit commented 1 year ago

Discussed on Arrow mailing list https://github.com/ava6969/panda-arrow.git Possibly also interesting:

red-data-tools / red_amber

Dataset size #145

1. data size

2. possible operations

Comparing features between RedAmber, dplyr/tidyr and pandas

Select columns (variables)

Select rows (records, observations)

Update columns / create new columns

Reshape dataframe

Grouping

Combine dataframes or tables