icenet-ai / icenet

The icenet library is a pip installable python package containing the commands and code you need to produce forecasts
MIT License
21 stars 7 forks source link

Producing "latest" training data potentially includes invalid ground truth dates #262

Open JimCircadian opened 5 months ago

JimCircadian commented 5 months ago

Description

Running a big icenet_dataset_create to cache the tfrecords. The available data is up to 25/12/2023, so the end date is configured as such. In running the process scripts with that as the end date, an invalid SIC selection is happening:

Traceback (most recent call last):
  File "/rds/user/USER/hpc-work/icenet/icenet/icenet/data/loaders/dask.py", line 408, in generate_sample
    sample_output = var_ds.siconca_abs.sel(time=forecast_dts)
  File "/home/USER/.conda/envs/icenet/lib/python3.9/site-packages/xarray/core/dataarray.py", line 1536, in sel
    ds = self._to_temp_dataset().sel(
  File "/home/USER/.conda/envs/icenet/lib/python3.9/site-packages/xarray/core/dataset.py", line 2573, in sel
    query_results = map_index_queries(
  File "/home/USER/.conda/envs/icenet/lib/python3.9/site-packages/xarray/core/indexing.py", line 188, in map_index_queries
    results.append(index.sel(labels, **options))
  File "/home/USER/.conda/envs/icenet/lib/python3.9/site-packages/xarray/core/indexes.py", line 489, in sel
    raise KeyError(f"not all values found in index {coord_name!r}")
KeyError: "not all values found in index 'time'"

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/home/USER/.conda/envs/icenet/bin/icenet_dataset_create", line 33, in <module>
    sys.exit(load_entry_point('icenet', 'console_scripts', 'icenet_dataset_create')())
  File "/rds/user/USER/hpc-work/icenet/icenet/icenet/data/loader.py", line 126, in create
    dl.generate()
  File "/rds/user/USER/hpc-work/icenet/icenet/icenet/data/loaders/dask.py", line 78, in generate
    self.client_generate(client,
  File "/rds/user/USER/hpc-work/icenet/icenet/icenet/data/loaders/dask.py", line 218, in client_generate
    in client.gather(futures):
  File "/home/USER/.conda/envs/icenet/lib/python3.9/site-packages/distributed/client.py", line 2372, in gather
    return self.sync(
  File "/rds/user/USER/hpc-work/icenet/icenet/icenet/data/loaders/dask.py", line 340, in generate_and_write
    x, y, sample_weights = generate_sample(date, var_ds, var_files,
  File "/rds/user/USER/hpc-work/icenet/icenet/icenet/data/loaders/dask.py", line 414, in generate_sample
    raise RuntimeError(sic_ex)
RuntimeError: "not all values found in index 'time'"

The location of this looks like it's in the ground truth select, meaning the generate_sample is maybe selecting dates past the range of the available training data. The icenet_process commands do not limit training date ranges based on number of days forecast, so we need to ensure the forecast window is correctly accounted for when creating samples.

$ cat loader.full_train_south.json | jq '.sources.osisaf.dates.train' | grep '2023_12_25'
  "2023_12_25"

This is likely only being observed as this training configuration is introducing data at the END of the full data window: the test and validation sets are pre-2023.

JimCircadian commented 5 months ago

The earliest 93 day window we can train from with that end date is 23/09/2023, so skimming 23/09/23 onwards from the loader dates for the moment