openclimatefix / ocf_datapipes

OCF's DataPipe based dataloader for training and inference
MIT License
13 stars 11 forks source link

Should we remove `datapipes`? Yes #342

Open peterdudfield opened 2 months ago

peterdudfield commented 2 months ago

Detailed Description

The idea is to remove torch datapipes from this repo. We would essentially replace this with normal python functions instead. For our ML models, we can then wrap these in torch datasets afterwards

pros and cons

pros cons
Less work not to do it Not sure what the benefits are for the extra code
Torch data is good for steaming data We use xarray, which is good for streaming large data
Its complex and hard to make changes
datapipes not well support by the community
forking is annoying
we have used some infinite loops are bad
debugging and logging is hard
we can use torch dataset which is widely used

Possible Implementation

  1. Start a fresh repo and copy over the functions we need
  2. Refactor this repo, -- pull out functions from all datapipes -- rebuild dataflow using new function. rebuilding these files
1. pros 2. pros
Nice to start with a fresh repo No code duplication
Easier to refactor, don need to worry about breaking tests Can continue developing
Could get one entire pipeline working first .e.g PVnet Dont need to setup CI
Might be able to itterate and solve each function

What I would like to keep

  1. Configuration
  2. batch strucutre
  3. making building blocks, so it doesnt matter what order we do things. passing around mainly xarray objects seemed to work
  4. There's some good readme that help, and here
  5. folder structure i think is nice

Other things to do

peterdudfield commented 1 month ago

Here's a list of all the current datapipes we have

  1. MergeNumpyBatchIterDataPipe
  2. MergeNumpyExamplesToBatchIterDataPipe
  3. MergeNumpyModalitiesIterDataPipe
  4. MergeNumpyModalitiesIterDataPipe
  5. ConvertLonLatToOSGBIterDataPipe
  6. ConvertOSGBToLonLatIterDataPipe
  7. ConvertGeostationaryToLonLatIterDataPipe
  8. StackXarrayIterDataPipe
  9. ConvertGSPToNumpyIterDataPipe
  10. ConvertPVToNumpyIterDataPipe
  11. ConvertGSPToNumpyBatchIterDataPipe
  12. ConvertNWPToNumpyBatchIterDataPipe
  13. ConvertPVToNumpyBatchIterDataPipe
  14. ConvertSatelliteToNumpyBatchIterDataPipe
  15. ConvertSensorToNumpyBatchIterDataPipe
  16. ConvertWindToNumpyBatchIterDataPipe
  17. OpenConfigurationIterDataPipe
  18. OpenSatelliteIterDataPipe
  19. OpenTopographyIterDataPipe
  20. OpenGSPFromDatabaseIterDataPipe
  21. OpenGSPIterDataPipe
  22. OpenGSPNationalIterDataPipe
  23. OpenNWPIterDataPipe
  24. OpenPVFromPVSitesDBIterDataPipe
  25. OpenPVFromNetCDFIterDataPipe
  26. OpenAWOSFromNetCDFIterDataPipe
  27. OpenMeteomaticsFromZarrIterDataPipe
  28. OpenWindFromNetCDFIterDataPipe
  29. ApplyPVDropoutIterDataPipe
  30. DrawDropoutTimeIterDataPipe
  31. ApplyDropoutTimeIterDataPipe
  32. FilterChannelsIterDataPipe
  33. FilterGSPIDsIterDataPipe
  34. FilterPvSysGeneratingOvernightIterDataPipe
  35. FilterPVSystemsWithOnlyNanInADayIterDataPipe
  36. FilterPVSystemsOnCapacityIterDataPipe
  37. FilterTimePeriodsIterDataPipe
  38. FilterTimesIterDataPipe
  39. FilterToOverlappingTimePeriodsIterDataPipe
  40. FindContiguousT0TimePeriodsIterDataPipe
  41. PickLocationsIterDataPipe
  42. PickLocationsAndT0sIterDataPipe
  43. PickT0TimesIterDataPipe
  44. SelectIDIterDataPipe
  45. SelectNonNaNTimesIterDataPipe
  46. SelectSpatialSlicePixelsIterDataPipe
  47. SelectSpatialSliceMetersIterDataPipe
  48. SelectTimeSliceIterDataPipe
  49. SelectTimeSliceNWPIterDataPipe
  50. PVNetSelectPVbyMLIDIterDataPipe
  51. ListMap
  52. SelectAllGSPSpatialSlicePixelsIterDataPipe
  53. SelectAllGSPSpatialSliceMetersIterDataPipe
  54. ConvertToNumpyBatchIterDataPipe
  55. DictDatasetIterDataPipe
  56. LoadDictDatasetIterDataPipe
  57. ConvertToNumpyBatchIterDataPipe
  58. AddFourierSpaceTimeIterDataPipe
  59. AddTopographicDataIterDataPipe
  60. AddSunPositionIterDataPipe
  61. AddT0IdxAndSamplePeriodDurationIterDataPipe
  62. ConvertPressureLevelsToSeparateVariablesIterDataPipe
  63. CreateSunImageIterDataPipe
  64. CreateTimeImageIterDataPipe
  65. DownsampleIterDataPipe
  66. NormalizeIterDataPipe
  67. ReprojectTopographyIterDataPipe
  68. UpSampleIterDataPipe
  69. CreateGSPImageIterDataPipe
  70. EnsureNGSPSPerExampleIterDataPipe
  71. AssignDayNightStatusIterDataPipe
  72. CreatePVImageIterDataPipe
  73. CreatePVMetadataImageIterDataPipe
  74. EnsureNPVSystemsPerExampleIterDataPipe
  75. PVFillNightNansIterDataPipe
  76. PVInterpolateInfillIterDataPipe
  77. PVPowerRollingWindowIterDataPipe
  78. PVPowerRemoveZeroDataIterDataPipe
  79. ZipperIterDataPipe
  80. RepeaterIterDataPipe
  81. UnZipperIterDataPipe
  82. LengthSetterIterDataPipe
  83. HeaderIterDataPipe
  84. CheckValueEqualToFractionIterDataPipe
  85. CheckGreaterThanOrEqualToIterDataPipe
  86. CheckLessThanOrEqualToIterDataPipe
  87. CheckNotEqualToIterDataPipe
  88. CheckNaNsIterDataPipe
  89. CheckVarsAndDimsIterDataPipe
peterdudfield commented 1 month ago

More detailed analysis here