holoviz-topics / examples

Visualization-focused examples using HoloViz for specific topics
https://examples.holoviz.org
Creative Commons Attribution 4.0 International
84 stars 25 forks source link

Modernize carbon flux #411

Closed Azaya89 closed 1 week ago

Azaya89 commented 3 months ago

Modernizing an example checklist

Preliminary checks

Change ‘anaconda-project.yml’ to use the latest workable version of packages

Plot API updates (discussed on a per-example basis)

Interactivity API updates (discussed on a per-example basis)

Panel App updates (discussed on a per-example basis)

General code quality updates

Text content

Visual appearance - Example

Visual appearance - Gallery

Workflow (after you have made the changes above)

Azaya89 commented 3 months ago

This is still a WIP. Not ready for review yet.

Azaya89 commented 3 months ago

Bug Report on this example notebook: Inconsistency with the usage of intake

These are the current issues preventing the complete modernization of this notebook:

  1. Version Compatibility: Although it is recommended to pin intake to <2, only version 0.6.2 runs without errors. For example, executing metadata = cat.fluxnet_metadata().read() results in the following traceback error with other versions:
Traceback ``` ValueError Traceback (most recent call last) Cell In[4], line 1 ----> 1 metadata = cat.fluxnet_metadata().read() 2 metadata.sample(5) File ~/Documents/development/holoviz-topics-examples/carbon_flux/envs/default/lib/python3.11/site-packages/intake/source/csv.py:190, in CSVSource.read(self) 186 return self._dask_df.compute() 188 import pandas as pd --> 190 self._get_schema() 191 return pd.concat([self._get_partition(i) for i in range(len(self.files()))]) File ~/Documents/development/holoviz-topics-examples/carbon_flux/envs/default/lib/python3.11/site-packages/intake/source/csv.py:142, in CSVSource._get_schema(self) 140 nrows = self._csv_kwargs.get("nrows") 141 self._csv_kwargs["nrows"] = 10 --> 142 df = self._get_partition(0) 143 if nrows is None: 144 del self._csv_kwargs["nrows"] File ~/Documents/development/holoviz-topics-examples/carbon_flux/envs/default/lib/python3.11/site-packages/intake/source/csv.py:160, in CSVSource._get_partition(self, i) 157 return self._dask_df.get_partition(i).compute() 159 url_part = self.files()[i] --> 160 return self._read_pandas(url_part, i) File ~/Documents/development/holoviz-topics-examples/carbon_flux/envs/default/lib/python3.11/site-packages/intake/source/csv.py:166, in CSVSource._read_pandas(self, url_part, i) 163 import pandas as pd 165 if self.pattern is None: --> 166 return pd.read_csv(url_part, storage_options=self._storage_options, **self._csv_kwargs) 168 drop_path_column = "include_path_column" not in self._csv_kwargs 169 path_column = self._path_column() File ~/Documents/development/holoviz-topics-examples/carbon_flux/envs/default/lib/python3.11/site-packages/pandas/io/parsers/readers.py:1026, in read_csv(filepath_or_buffer, sep, delimiter, header, names, index_col, usecols, dtype, engine, converters, true_values, false_values, skipinitialspace, skiprows, skipfooter, nrows, na_values, keep_default_na, na_filter, verbose, skip_blank_lines, parse_dates, infer_datetime_format, keep_date_col, date_parser, date_format, dayfirst, cache_dates, iterator, chunksize, compression, thousands, decimal, lineterminator, quotechar, quoting, doublequote, escapechar, comment, encoding, encoding_errors, dialect, on_bad_lines, delim_whitespace, low_memory, memory_map, float_precision, storage_options, dtype_backend) 1013 kwds_defaults = _refine_defaults_read( 1014 dialect, 1015 delimiter, (...) 1022 dtype_backend=dtype_backend, 1023 ) 1024 kwds.update(kwds_defaults) -> 1026 return _read(filepath_or_buffer, kwds) File ~/Documents/development/holoviz-topics-examples/carbon_flux/envs/default/lib/python3.11/site-packages/pandas/io/parsers/readers.py:620, in _read(filepath_or_buffer, kwds) 617 _validate_names(kwds.get("names", None)) 619 # Create the parser. --> 620 parser = TextFileReader(filepath_or_buffer, **kwds) 622 if chunksize or iterator: 623 return parser File ~/Documents/development/holoviz-topics-examples/carbon_flux/envs/default/lib/python3.11/site-packages/pandas/io/parsers/readers.py:1620, in TextFileReader.__init__(self, f, engine, **kwds) 1617 self.options["has_index_names"] = kwds["has_index_names"] 1619 self.handles: IOHandles | None = None -> 1620 self._engine = self._make_engine(f, self.engine) File ~/Documents/development/holoviz-topics-examples/carbon_flux/envs/default/lib/python3.11/site-packages/pandas/io/parsers/readers.py:1880, in TextFileReader._make_engine(self, f, engine) 1878 if "b" not in mode: 1879 mode += "b" -> 1880 self.handles = get_handle( 1881 f, 1882 mode, 1883 encoding=self.options.get("encoding", None), 1884 compression=self.options.get("compression", None), 1885 memory_map=self.options.get("memory_map", False), 1886 is_text=is_text, 1887 errors=self.options.get("encoding_errors", "strict"), 1888 storage_options=self.options.get("storage_options", None), 1889 ) 1890 assert self.handles is not None 1891 f = self.handles.handle File ~/Documents/development/holoviz-topics-examples/carbon_flux/envs/default/lib/python3.11/site-packages/pandas/io/common.py:728, in get_handle(path_or_buf, mode, encoding, compression, memory_map, is_text, errors, storage_options) 725 codecs.lookup_error(errors) 727 # open URLs --> 728 ioargs = _get_filepath_or_buffer( 729 path_or_buf, 730 encoding=encoding, 731 compression=compression, 732 mode=mode, 733 storage_options=storage_options, 734 ) 736 handle = ioargs.filepath_or_buffer 737 handles: list[BaseBuffer] File ~/Documents/development/holoviz-topics-examples/carbon_flux/envs/default/lib/python3.11/site-packages/pandas/io/common.py:453, in _get_filepath_or_buffer(filepath_or_buffer, encoding, compression, mode, storage_options) 445 return IOArgs( 446 filepath_or_buffer=file_obj, 447 encoding=encoding, (...) 450 mode=fsspec_mode, 451 ) 452 elif storage_options: --> 453 raise ValueError( 454 "storage_options passed with file object or non-fsspec file path" 455 ) 457 if isinstance(filepath_or_buffer, (str, bytes, mmap.mmap)): 458 return IOArgs( 459 filepath_or_buffer=_expand_user(filepath_or_buffer), 460 encoding=encoding, (...) 463 mode=mode, 464 ) ValueError: storage_options passed with file object or non-fsspec file path ```

Pinning intake=0.6.2 resolves this issue without any traceback errors.

  1. Inconsistency in File Downloads: The cell responsible for downloading the full fluxnet files shows inconsistent behavior:
s3 = S3FileSystem(anon=True)
s3_paths = s3.glob('earth-data/carbon_flux/nee_data_fusion/FLX*')

datasets = []
skipped = []
used = []

for i, s3_path in enumerate(s3_paths):
    sys.stdout.write(f'\r{i+1}/{len(s3_paths)}')

    try:
        dd = cat.fluxnet_daily(s3_path=s3_path).to_dask()
    except FileNotFoundError:
        try:
            dd = cat.fluxnet_daily(s3_path=s3_path.split('/')[-1]).to_dask()
        except FileNotFoundError:
            continue
    site = dd['site'].cat.categories.item()

    if not set(dd.columns) >= set(data_columns):
        skipped.append(site)
        continue

    datasets.append(clean_data(dd))
    used.append(site)

print()
print(f'Found {len(used)} fluxnet sites with enough data to use - skipped {len(skipped)}')

This cell sometimes generates the following traceback:

Traceback ``` 1/209 /Users/mac/Documents/development/examples/carbon_flux/envs/default/lib/python3.11/site-packages/dask_expr/_collection.py:4160: UserWarning: You did not provide metadata, so Dask is running your function on a small dataset to guess output types. It is possible that Dask will guess incorrectly. To provide an explicit output types or to silence this message, please provide the `meta=` keyword, as described in the map or apply function that you are using. Before: .apply(func) After: .apply(func, meta=(None, 'object')) warnings.warn(meta_warning(meta)) /Users/mac/Documents/development/examples/carbon_flux/envs/default/lib/python3.11/site-packages/dask_expr/_collection.py:4160: UserWarning: You did not provide metadata, so Dask is running your function on a small dataset to guess output types. It is possible that Dask will guess incorrectly. To provide an explicit output types or to silence this message, please provide the `meta=` keyword, as described in the map or apply function that you are using. Before: .apply(func) After: .apply(func, meta=('TIMESTAMP', 'object')) warnings.warn(meta_warning(meta)) ```

This warning is repeated for all the cells up to 209/209.

The circumstances under which this error occurs are unclear. A temporary solution, discovered with the help of @hoxbro, involves removing the local version of intake and re-downloading it using anaconda-project run. This typically resolves the issue. However, restarting the kernel and running the notebook from the top down might bring back the Traceback error.

  1. Cell [20] Error: The following code in Cell [20] generates a traceback error when the full data is not downloaded properly (as in problem 2):
partial_soil_data = df[df[soil_data_columns].notnull().any(1)]
partial_soil_data_sites = metadata[metadata.site.isin(partial_soil_data.site.unique())]

Traceback:

TypeError                                 Traceback (most recent call last)
Cell In[20], line 1
----> 1 partial_soil_data = df[df[soil_data_columns].notnull().any(1)]
      2 partial_soil_data_sites = metadata[metadata.site.isin(partial_soil_data.site.unique())]

TypeError: DataFrame.any() takes 1 positional argument but 2 were given

Using any(axis=1) resolves this error. However, if problem 2 does not occur, this cell runs without the TypeError.

@maximlt @droumis

Azaya89 commented 1 month ago
  1. I have completely re-wrote the notebook to remove all usage of intake.

  2. The .csv files are downloaded locally via awscli by running anaconda-project run download_fluxnet_daily. This takes about a minute to download all the files and saves in the same folder as the .txt file.

  3. Some of the cells are failing the test now and I don't know why. I will investigate that later.

Otherwise, I think this is ready for review now.

@hoxbro

hoxbro commented 3 weeks ago

I have pushed a fix that will make the test pass. I'm unsure why it doesn't work when you scatter the index.

The doc build is failing; @Azaya89, can you try and see if you can fix this?

maximlt commented 2 weeks ago

Arf @Azaya89 I see we're still having some issues. The error we encounter looks very similar to the one reported here https://github.com/aws/aws-cli/issues/8988. Digging more into this direction should hopefully give us a solution. This for instance looks promising https://github.com/aws/aws-cli/issues/5623#issuecomment-801240811, this too https://stackoverflow.com/questions/64992288/s3-sync-issue-running-in-azure-devops-pipeline-on-linux.

Azaya89 commented 2 weeks ago

Arf @Azaya89 I see we're still having some issues. The error we encounter looks very similar to the one reported here aws/aws-cli#8988. Digging more into this direction should hopefully give us a solution. This for instance looks promising aws/aws-cli#5623 (comment), this too https://stackoverflow.com/questions/64992288/s3-sync-issue-running-in-azure-devops-pipeline-on-linux.

Thank you. Let me try this out...

github-actions[bot] commented 2 weeks ago

Your changes were successfully integrated in the dev site, make sure to review the pages of the projects you touched before merging this PR.

Azaya89 commented 2 weeks ago

The doc build is failing; @Azaya89, can you try and see if you can fix this?

Fixed. I think it is ready for final review now @hoxbro

github-actions[bot] commented 2 weeks ago

Your changes were successfully integrated in the dev site, make sure to review the pages of the projects you touched before merging this PR.

github-actions[bot] commented 1 week ago

Your changes were successfully integrated in the dev site, make sure to review the pages of the projects you touched before merging this PR.

Azaya89 commented 1 week ago

Another run has replaced the dev docs site. I want to make sure you checked if everything looked good before it was replaced.

LGTM!

github-actions[bot] commented 1 week ago

Your changes were successfully integrated in the dev site, make sure to review the pages of the projects you touched before merging this PR.