jamesward / koober

336 stars 105 forks source link

Data Loader Improvements #9

Open jamesward opened 7 years ago

jamesward commented 7 years ago

Improvements to the current dataloader:

adamwhat commented 7 years ago

Some other considerations:

  1. Raw data is stored in S3. The dataset is 25GB per year across multiple years. In the sample data loader, how do we plan to initialize/download the data into local? I can also look into streaming solutions to sample data.

  2. For both fake data and sample data loader, should we save the result in disk/S3?

  3. For the fake data, we can specify the following parameters:

    1. time range
    2. number of clusters
    3. demand distribution per clusters (or we can assume it's uniform)
  4. We also want to incorporate weather as a feature. I will look into data sources for weather.

adamwhat commented 7 years ago

Another thing we want to explore is unit testing. We are running into some problems with testing in the PIO engine. I think this part is a good start to incorporate the unit test framework and act as an example for other components.

@jamesward do you have any recommendations for the unit testing framework?

jamesward commented 7 years ago
  1. The demo data loader currently caches to the local filesystem: https://github.com/jamesward/koober/blob/master/demo-data/src/main/scala/DemoData.scala#L26-L45 But we will probably want to make sure we have enough space and if not maybe resort to streaming. This should be pretty straightforward with Akka Streams.
  2. It would be nice to not have to resample the data. Probably easiest right now to store the sampled data locally. Then down the road we can either automate the storage to S3 or just do it manually.
  3. Sounds good.

For unit testing I think we can use the demo data loader as a library in the pio project. That'll be one nice thing about having this in Scala.

jamesward commented 7 years ago

It is unfortunate that these aren't zipped: https://github.com/toddwschneider/nyc-taxi-data/blob/master/raw_data_urls.txt

I wonder if we should zip them and put them back on S3. Or maybe work with the provider to do that. Seems like that'd save some serious money.