Open riley-brady opened 1 year ago
FYI, it seems like this works just fine if I revert back to pip install ecgtools==2021.9.23
, so it seems like this is a breaking issue with the new ecgtools
release.
Although I would prefer to use the new release, since I'd rather pass in a list of files to the builder and do the directory crawling on my own. In most cases, I'll be traversing an s3
directory and want to write some custom code for deriving the filelist. It doesn't seem like the built-in crawler from 2021.9.23
can handle S3Path
(See https://github.com/liormizr/s3path).
Sorry for these streams of consciousness.
I'm realizing that even the new version uses the base paths to crawl and find the files, which has been breaking for me when trying to traverse an S3 directory. I'd like an option to just pass in raw S3 files.
I hacked this together with a local install of the package via:
*
from https://github.com/ncar-xdev/ecgtools/blob/f9340de17f9bf3d3ae58d97d4538d7831e9d8763/ecgtools/builder.py#L176 which seems to force .build(...)
to take two arguments rather than just the custom parser.@pydantic.validate_arguments
to shut down the pydantic
checker.def get_assets(self):
self.assets = self.paths
return self
*
from https://github.com/ncar-xdev/ecgtools/blob/f9340de17f9bf3d3ae58d97d4538d7831e9d8763/ecgtools/builder.py#L214 to get my catalog to save (similarly threw an error about expecting 2 arguments... maybe this is a necessity for pydantic
?)Probably (1) and (4) are unsustainable, because I'm sure it does something but I'm in a rush. (2) Could be fixed of course with focus on argument validation (which might be related to issues with (1)).
(3) Could be some sort of switch in Builder
, something like crawl=bool
.
Happy to lead an effort on a PR here if you find this valuable. This would be huge for my work, since I can just do the crawling myself and pass a list of Zarr files from a private S3 store. Please let me know if there's some way else to make the crawler work with an S3 store, though. Based on the size of our datasets on S3, it's not feasible for me to build the catalog locally (we're deriving/downloading/publishing the datasets natively through AWS).
Thank you for your patience @riley-brady! i just realized i accidentally stopped watching this repository a while ago.
I am trying to build a simple custom parser. I am following this guide.
It turns out that guide is a bit outdated ;(
ValidationError: 2 validation errors for Build parsing_func field required (type=value_error.missing) args 1 positional arguments expected but 2 given (type=type_error)
w/ the keyword-only argument requirement via *
, the following should fix the ValidationError
cat_builder.build(parsing_func=parse_dummy)
cat_builder.save(name=..., another_argument=..., another_argument=...)
(3) Could be some sort of switch in Builder, something like crawl=bool.
👍🏽 for providing an option to disable the crawling or allowing users to provide custom crawlers.
if you are looking into a quick workaround, the following might work
import joblib
def parsing_func(file):
....
cat_builder = Builder(.....)
cat_builders.assets = assets # assets here is a list of files to parse
cat_builders.entries = joblib.Parallel(**cat_builder.joblib_parallel_kwargs)(
joblib.delayed(parsing_func)(asset, **parsing_func_kwargs) for asset in cat_builder.assets
)
cat_builder.df = pd.DataFrame(cat_builder.entries)
cat_builder = cat_builder.clean_dataframe()
After this, you should be able to save the catalog
cat_builder.save(name=..., another_argument=..., another_argument=...)
👍🏽 👍🏽 for a PR when you have time...
What happened?
When attempting to run
Build.build(custom_parser)
I get apydantic
error that kills the catalog build, despite following the tutorial and successfully running the parser on test files.What did you expect to happen?
A catalog to build successfully
Minimal Complete Verifiable Example
I built a simple custom parser to test this out, following the guide.
I tested this on a simple file and it was successful
Now I make a builder object. The object successfully returns the expected list of files.
Build the catalog with the parse script.
Relevant log output
Anything else we need to know?
pydantic
version: '1.10.2'ecgtools
version: '2022.10.7'