openclimatefix / nowcasting_dataset

Prepare batches of data for training machine learning solar electricity nowcasting data
https://nowcasting-dataset.readthedocs.io/en/stable/
MIT License
24 stars 6 forks source link

Discussion: For testing, should we use "fake" data or a small amount of real data? #512

Open JackKelly opened 2 years ago

JackKelly commented 2 years ago

(Let's not worry about this now... just making a note to discuss in early 2022!)

As we all know, in order for "fake" data to be useful for testing, the "fake" data needs to accurately capture almost all of the structure of "real" data. Otherwise the "fake" data could drive us to reach incorrect conclusions when debugging and testing our code (as happened when debugging the OpticalFlowDatasource tests).

Creating really "realistic" fake data is probably quite a lot of effort (for example, see issue #511).

I suppose I'm curious whether it might actually be less work to use a small amount of real data for testing, instead of maintaining code to create "fake" data on the fly? And include this sample of real data in the nowcasting_dataset/tests/data/ folder?

Strictly speaking, we're not allowed to share some of our data sources. Maybe it wouldn't be too much work to obfuscate a small amount of "real" data (e.g. PV locations could be the LSOA locations that we're allowed to share publicly. And, for other data sources, we could add a small amount of random noise to all the data?)

jacobbieker commented 2 years ago

Yeah, I think maintining the fake data as we update the real data, etc. has been a bit of an issue and caused some issues quite a few times, so I would be all for usign some real data with some noise in the data values for this!

JackKelly commented 2 years ago

I should add that I think the idea of adding fake() was a really good idea... it's just that, now that nowcasting_dataset is becoming more complex, it might be advantageous to shift (in the near year?) to using a tiny amount of "obfuscated" real data instead of "fake" data?

peterdudfield commented 2 years ago

This is an interesting one. I think theres probably a need for both. and there is definitely pros and cos for both.

Perhaps one way to look it is Unittests: Use fake data. This needs some maintain to make sure the data is made correctly. This can be a drag, but is kinda handy if the data pipeline before is changing. It may also be quicker? to make fake data that load a file? I do like this for further down the ML pipeline, there's an easy way to make fake batches.

Test data tests: make sure there are test that run off the test data. I think at the moment test_manager.py might do this. This makes sure that functions work correctly with the real data. Downside is we have to git commit ~10 MBs or more, which seems perhaps against traditional practice. This test data has to be also updated if there is a change to the data pipeline.