Closed JackKelly closed 2 years ago
@jacobbieker @peterdudfield just FYI, my plan is to upload two pre-prepared batches for each modality to a publicly-accessible Google Cloud storage bucket: gs://ocf-public/data_for_unit_tests/prepared_ML_training_data
. Then I'll write a little utility function described above.
copied data using 1-line bash loop:
for dir in gsp nwp opticalflow pv satellite sun topographic; do
gsutil -m cp -J
gs://solar-pv-nowcasting-data/prepared_ML_training_data/v15/test/$dir/00000[01].nc
gs://ocf-public/data_for_unit_tests/prepared_ML_training_data/$dir;
done
Sounds good, Give it a go, the way ive done it before was to use
from nowcasting_dataset.dataset.batch import Batch
from nowcasting_dataset.config.model import Configuration
configuration=Configuration()
configuration.input_data = configuration
configuration=Configuration()
configuration.input_data = configuration.input_data.set_all_to_defaults()
Batch.fake(configuration=configuration)
batch = Batch.fake()
I think there are just two different ways of tackling the same problem. Both have advantages and disadvantages
GCP files: Advantages:
Cons:
Fake: Advantage:
Cons:
Probably best to try it out, and make both approaches are good in there different ways
Yeah, good point, I do like the fake
batches... but, for this, I'm also keen to use the "real" data to help benchmark loading speed. So it feels like "real" data is probably a good option here.
To help connect the dots, I should link to a highly related discussion: https://github.com/openclimatefix/nowcasting_dataset/issues/512
Nice one - good luck with it
Publish a couple of pre prepared batches (per modality) in a new GCP bucket.
Write a simple function
power_perceiver.testing.get_test_data_filename
which checks if testing data already exists locally (in a temporary directory) and if not downloads it.