Open jamesward opened 7 years ago
Some other considerations:
Raw data is stored in S3. The dataset is 25GB per year across multiple years. In the sample data loader, how do we plan to initialize/download the data into local? I can also look into streaming solutions to sample data.
For both fake data and sample data loader, should we save the result in disk/S3?
For the fake data, we can specify the following parameters:
We also want to incorporate weather as a feature. I will look into data sources for weather.
Another thing we want to explore is unit testing. We are running into some problems with testing in the PIO engine. I think this part is a good start to incorporate the unit test framework and act as an example for other components.
@jamesward do you have any recommendations for the unit testing framework?
For unit testing I think we can use the demo data loader as a library in the pio project. That'll be one nice thing about having this in Scala.
It is unfortunate that these aren't zipped: https://github.com/toddwschneider/nyc-taxi-data/blob/master/raw_data_urls.txt
I wonder if we should zip them and put them back on S3. Or maybe work with the provider to do that. Seems like that'd save some serious money.
Improvements to the current dataloader: