new task: read - Githubissues

jangorecki commented 4 years ago

Reading data benchmark is on the roadmap. It should cover:

reading csv most portable tabular data format, to cover transferring data between different solutions
reading a binary formats most solution-specific formats, to cover transferring data within the same solution
data of numeric fields only (integer and floats)
data of 50% categorical fields (integer, floats and categorical)
character fields
date, time and datetime fields

ideas for testing particular features (maybe advanced questions?)

feedback welcome

jangorecki commented 3 years ago

I collected some feedback about this task from our internal discussion.

Initially I will focus only on reading csv, not a binary formats.

For real world data NYT will be good first case, we should probably find one more popular dataset, to have two real world data.

For simulated data:

shape: long, wide, long and wide (fixed rows*cols?)
types separately (3 columns of each type): int, double, char, factor, date, datetime
types mixed (one columns of each type)
cardinality (count of unq values)

MichaelChirico commented 3 years ago

h2oai / db-benchmark