[Bug]: pydantic causing simple parse script to fail on build

riley-brady commented 1 year ago

What happened?

When attempting to run Build.build(custom_parser) I get a pydantic error that kills the catalog build, despite following the tutorial and successfully running the parser on test files.

What did you expect to happen?

A catalog to build successfully

Minimal Complete Verifiable Example

I am trying to build a simple custom parser. I am following [this guide](https://ecgtools.readthedocs.io/en/latest/how-to/use-a-custom-parser.html).

0. Make some fake data.

for var in ['pr', 'tas', 'tasmax']:
    ds = xr.DataArray([1,2,3], dims='time').rename(var).to_dataset()
    ds.to_netcdf(f"./data/{var}_daily_EC-Earth3_historical.nc")

1. Put together list of paths.
```python
root_path = pathlib.Path("./data/")
files = list(root_path.glob("*.nc"))
# Convert to string paths since the builder only takes string paths.
files = [str(f) for f in files]

I built a simple custom parser to test this out, following the guide.

def parse_dummy(file):
fpath = pathlib.Path(file)
info = {}

try:
    # Just extracting metadata from the filename in this case. Same errors
    # occur when including loading into a dataset.
    variable, temporal_resolution, model, scenario = fpath.stem.split('_')
    info = {
            'variable': variable,
            'temporal': temporal_resolution,
            'source': model,
            'path': str(file),
        }

    return info

except Exception:
    return {INVALID_ASSET: file, TRACEBACK: traceback.format_exc()}

I tested this on a simple file and it was successful

parse_dummy(files[0])
{'variable': 'pr',
'temporal': 'daily',
'source': 'EC-Earth3',
'path': './data/pr_daily_EC-Earth3_historical.nc'}

Now I make a builder object. The object successfully returns the expected list of files.

# Tried the pathlib object here following the demo but got an error that pydantic
# wanted strings only.
cat_builder = Builder(files)
>>> Builder(paths=['data/pr_daily_EC-Earth3_historical.nc', 'data/tasmax_daily_EC-Earth3_historical.nc', 'data/tas_daily_EC-Earth3_historical.nc'], storage_options={}, depth=0, exclude_patterns=[], include_patterns=[], joblib_parallel_kwargs={})

Build the catalog with the parse script.
```
cat_builder.build(parse_dummy)
```

Relevant log output

---------------------------------------------------------------------------
ValidationError                           Traceback (most recent call last)
Input In [84], in <cell line: 1>()
----> 1 cat_builder.build(parse_fake)

File ~/miniconda3/envs/analysis_py39/lib/python3.9/site-packages/pydantic/decorator.py:40, in pydantic.decorator.validate_arguments.validate.wrapper_function()

File ~/miniconda3/envs/analysis_py39/lib/python3.9/site-packages/pydantic/decorator.py:133, in pydantic.decorator.ValidatedFunction.call()

File ~/miniconda3/envs/analysis_py39/lib/python3.9/site-packages/pydantic/decorator.py:130, in pydantic.decorator.ValidatedFunction.init_model_instance()

File ~/miniconda3/envs/analysis_py39/lib/python3.9/site-packages/pydantic/main.py:342, in pydantic.main.BaseModel.__init__()

ValidationError: 2 validation errors for Build
parsing_func
  field required (type=value_error.missing)
args
  1 positional arguments expected but 2 given (type=type_error)

Anything else we need to know?

pydantic version: '1.10.2' ecgtools version: '2022.10.7'

riley-brady commented 1 year ago

FYI, it seems like this works just fine if I revert back to pip install ecgtools==2021.9.23, so it seems like this is a breaking issue with the new ecgtools release.

Although I would prefer to use the new release, since I'd rather pass in a list of files to the builder and do the directory crawling on my own. In most cases, I'll be traversing an s3 directory and want to write some custom code for deriving the filelist. It doesn't seem like the built-in crawler from 2021.9.23 can handle S3Path (See https://github.com/liormizr/s3path).

riley-brady commented 1 year ago

Sorry for these streams of consciousness.

I'm realizing that even the new version uses the base paths to crawl and find the files, which has been breaking for me when trying to traverse an S3 directory. I'd like an option to just pass in raw S3 files.

I hacked this together with a local install of the package via:

Deleting the * from https://github.com/ncar-xdev/ecgtools/blob/f9340de17f9bf3d3ae58d97d4538d7831e9d8763/ecgtools/builder.py#L176 which seems to force .build(...) to take two arguments rather than just the custom parser.
Deleting all cases of @pydantic.validate_arguments to shut down the pydantic checker.
Changing https://github.com/ncar-xdev/ecgtools/blob/f9340de17f9bf3d3ae58d97d4538d7831e9d8763/ecgtools/builder.py#L141-L144 to the following:
```
def get_assets(self):
    self.assets = self.paths
    return self
```
Deleting the * from https://github.com/ncar-xdev/ecgtools/blob/f9340de17f9bf3d3ae58d97d4538d7831e9d8763/ecgtools/builder.py#L214 to get my catalog to save (similarly threw an error about expecting 2 arguments... maybe this is a necessity for pydantic?)

Probably (1) and (4) are unsustainable, because I'm sure it does something but I'm in a rush. (2) Could be fixed of course with focus on argument validation (which might be related to issues with (1)).

(3) Could be some sort of switch in Builder, something like crawl=bool.

Happy to lead an effort on a PR here if you find this valuable. This would be huge for my work, since I can just do the crawling myself and pass a list of Zarr files from a private S3 store. Please let me know if there's some way else to make the crawler work with an S3 store, though. Based on the size of our datasets on S3, it's not feasible for me to build the catalog locally (we're deriving/downloading/publishing the datasets natively through AWS).

andersy005 commented 1 year ago

Thank you for your patience @riley-brady! i just realized i accidentally stopped watching this repository a while ago.

I am trying to build a simple custom parser. I am following this guide.

It turns out that guide is a bit outdated ;(

ValidationError: 2 validation errors for Build parsing_func field required (type=value_error.missing) args 1 positional arguments expected but 2 given (type=type_error)

w/ the keyword-only argument requirement via *, the following should fix the ValidationError

cat_builder.build(parsing_func=parse_dummy)

cat_builder.save(name=..., another_argument=..., another_argument=...)

(3) Could be some sort of switch in Builder, something like crawl=bool.

👍🏽 for providing an option to disable the crawling or allowing users to provide custom crawlers.

if you are looking into a quick workaround, the following might work

import joblib

def parsing_func(file):
   ....

cat_builder  = Builder(.....)
cat_builders.assets = assets # assets here is a list of files to parse
cat_builders.entries =   joblib.Parallel(**cat_builder.joblib_parallel_kwargs)(
            joblib.delayed(parsing_func)(asset, **parsing_func_kwargs) for asset in cat_builder.assets
        )

cat_builder.df = pd.DataFrame(cat_builder.entries)
cat_builder = cat_builder.clean_dataframe()

After this, you should be able to save the catalog

cat_builder.save(name=..., another_argument=..., another_argument=...)

👍🏽 👍🏽 for a PR when you have time...

ncar-xdev / ecgtools