Data for testing - Githubissues

JackKelly commented 2 years ago

Publish a couple of pre prepared batches (per modality) in a new GCP bucket.

Write a simple function power_perceiver.testing.get_test_data_filename which checks if testing data already exists locally (in a temporary directory) and if not downloads it.

JackKelly commented 2 years ago

@jacobbieker @peterdudfield just FYI, my plan is to upload two pre-prepared batches for each modality to a publicly-accessible Google Cloud storage bucket: gs://ocf-public/data_for_unit_tests/prepared_ML_training_data. Then I'll write a little utility function described above.

JackKelly commented 2 years ago

copied data using 1-line bash loop:

for dir in gsp nwp opticalflow pv satellite sun topographic; do 
  gsutil -m cp -J 
  gs://solar-pv-nowcasting-data/prepared_ML_training_data/v15/test/$dir/00000[01].nc 
  gs://ocf-public/data_for_unit_tests/prepared_ML_training_data/$dir; 
done

peterdudfield commented 2 years ago

Sounds good, Give it a go, the way ive done it before was to use

from nowcasting_dataset.dataset.batch import Batch
from nowcasting_dataset.config.model import Configuration

configuration=Configuration()
configuration.input_data = configuration
configuration=Configuration()
configuration.input_data = configuration.input_data.set_all_to_defaults()
Batch.fake(configuration=configuration)

batch = Batch.fake()

I think there are just two different ways of tackling the same problem. Both have advantages and disadvantages

GCP files: Advantages:

It's real data.
easy to access
help bench mark load speed

Cons:

Might need updating with a new version
data is the same each time

Fake: Advantage:

creates random data,
can adjust depending on configruation

Cons:

Needs updating, / alinging with dataset
not real daya
can be slow with large (batch > 128)

Probably best to try it out, and make both approaches are good in there different ways

JackKelly commented 2 years ago

Yeah, good point, I do like the fake batches... but, for this, I'm also keen to use the "real" data to help benchmark loading speed. So it feels like "real" data is probably a good option here.

To help connect the dots, I should link to a highly related discussion: https://github.com/openclimatefix/nowcasting_dataset/issues/512

peterdudfield commented 2 years ago

Nice one - good luck with it

openclimatefix / power_perceiver

Data for testing #2