openghg / openghg

A cloud platform for greenhouse gas (GHG) data analysis and collaboration.
https://www.openghg.org
Apache License 2.0
30 stars 4 forks source link

Improve standardisation of multiple footprints - Zarr uniform chunk #1162

Open alexdanjou opened 1 week ago

alexdanjou commented 1 week ago

Is your feature request related to a problem?

When trying to standardise multiple files at once, a chunk size error is sometimes raised. The files have to be standardised one by one which seems less efficient (and less user friendly).

Standardisation attempt :

from glob import glob
from openghg.standardise._standardise import standardise_footprint

standardise_footprint(filepath=glob(f'/group/chemistry/acrg/LPDM/fp_NAME/EUROPE/HPB-130magl/rn/*'),
                      site= 'HPB',
                      inlet= '130magl',
                      model= 'NAME',
                      met_model= 'UKV',
                      domain= 'EUROPE',
                      species='rn',
                      source_format= 'paris',
                      store= 'ad_test_zarr',
                      period= None,
                      continuous= True)

Error message :

Traceback (most recent call last):
  File "/user/home/bq24992/workingDir/PARIS/Footprints/test.py", line 25, in <module>
    standardise_footprint(**kwargs)
  File "/user/home/bq24992/openghg/openghg/standardise/_standardise.py", line 621, in standardise_footprint
    return standardise(
           ^^^^^^^^^^^^
  File "/user/home/bq24992/openghg/openghg/standardise/_standardise.py", line 51, in standardise
    result = dc.read_file(filepath=filepath, **kwargs)
             ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/user/home/bq24992/openghg/openghg/store/_footprints.py", line 416, in read_file
    datasource_uuids = self.assign_data(
                       ^^^^^^^^^^^^^^^^^
  File "/user/home/bq24992/openghg/openghg/store/base/_base.py", line 367, in assign_data
    datasource.add_data(
  File "/user/home/bq24992/openghg/openghg/store/base/_datasource.py", line 155, in add_data
    return self.add_timed_data(
           ^^^^^^^^^^^^^^^^^^^^
  File "/user/home/bq24992/openghg/openghg/store/base/_datasource.py", line 245, in add_timed_data
    self._store.add(version=version_str, dataset=data, compressor=compressor, filters=filters)
  File "/user/home/bq24992/openghg/openghg/store/storage/_localzarrstore.py", line 178, in add
    dataset.to_zarr(
  File "/user/home/bq24992/.local/lib/python3.12/site-packages/xarray/core/dataset.py", line 2520, in to_zarr
    return to_zarr(  # type: ignore[call-overload,misc]
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/user/home/bq24992/.local/lib/python3.12/site-packages/xarray/backends/api.py", line 1846, in to_zarr
    dump_to_store(dataset, zstore, writer, encoding=encoding)
  File "/user/home/bq24992/.local/lib/python3.12/site-packages/xarray/backends/api.py", line 1386, in dump_to_store
    store.store(variables, attrs, check_encoding, writer, unlimited_dims=unlimited_dims)
  File "/user/home/bq24992/.local/lib/python3.12/site-packages/xarray/backends/zarr.py", line 667, in store
    self.set_variables(
  File "/user/home/bq24992/.local/lib/python3.12/site-packages/xarray/backends/zarr.py", line 714, in set_variables
    encoding = extract_zarr_variable_encoding(
               ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/user/home/bq24992/.local/lib/python3.12/site-packages/xarray/backends/zarr.py", line 284, in extract_zarr_variable_encoding
    chunks = _determine_zarr_chunks(
             ^^^^^^^^^^^^^^^^^^^^^^^
  File "/user/home/bq24992/.local/lib/python3.12/site-packages/xarray/backends/zarr.py", line 135, in _determine_zarr_chunks
    raise ValueError(
ValueError: Zarr requires uniform chunk sizes except for final chunk. Variable named 'particle_locations_n' has incompatible dask chunks: ((372, 372, 336, 336, 372, 372, 360, 360, 372, 372, 360, 360, 372, 372, 372, 372, 360, 360, 372, 372, 360, 360, 372, 372), (10, 10), (196, 195)). Consider rechunking using `chunk()`.

Describe the solution you'd like

Would it be possible to simply rechunk the dataset?

In my case, the solution that I put below is working. Replacing https://github.com/openghg/openghg/blob/3eca7392b0f170508509be2fbad46a6eaec2dd86/openghg/store/storage/_localzarrstore.py#L162C1-L170C14 by :


            try: 
                dataset.to_zarr(
                    store=self._stores[version],
                    mode="w",
                    encoding=encoding,
                    consolidated=True,
                    compute=True,
                    synchronizer=zarr.ThreadSynchronizer(),
                )
            except ValueError as v:
                if ('Zarr requires uniform chunk sizes except for final chunk.' in str(v)) \
                    and dataset.attrs['time_period']=='1 hour':
                    dataset.chunk({'time':'M'}).to_zarr(
                        store=self._stores[version],
                        mode="w",
                        encoding=encoding,
                        consolidated=True,
                        compute=True,
                        synchronizer=zarr.ThreadSynchronizer(),
                    )
                    pass
                else :
                    raise v

I'm not sure how much this correction and the condition that I put (dataset.attrs['time_period']=='1 hour') are sensible as I don't really know how are the others circonstances in which this function is called.

Describe alternatives you've considered

No response

Additional context

No response

brendan-m-murphy commented 1 week ago

Can you run this through a debugger and check the chunk sizes? I'm away until Friday, but could look into it then.

The data should be rechunked after open mfdataset is called. If this is a new datasource maybe the chunks aren't being set correctly. (They should be...)

alexdanjou commented 1 week ago

Hum your right a chunking should have been done also at the previous line. I'm checking that