red-data-tools / red-datasets

A RubyGem that provides common datasets
MIT License
30 stars 25 forks source link

Large data sets #126

Open bkmgit opened 2 years ago

bkmgit commented 2 years ago

It may be good to have a different way to work with large datasets. For example the https://ldbcouncil.org/benchmarks/graphalytics/ data sets are 1.1Tb in total.

kou commented 2 years ago

Do you have any idea?

What should we care about it? Local storage size? Download time? ...?

bkmgit commented 2 years ago
  1. Can have a threshold for streaming data from disk, instead of reading all data into memory. Have a default such as 500Mb which the user can adjust
  2. Local storage may be an issue, perhaps ask the user if they want to proceed and give an estimate of storage space required.
  3. For download time, cannot do much about this, on Linux wget -c is helpful for continuing an incomplete download without starting again. If the data is stored on the cloud in a suitable form, one can stream the interesting portion, but this requires infrastructure allows this and perhaps is another step for the future. At present want to consider datasets upto 100 Gb which may be analyzed on a workstation.