intake / intake-esm

An intake plugin for parsing an Earth System Model (ESM) catalog and loading assets into xarray datasets.
https://intake-esm.readthedocs.io
Apache License 2.0
130 stars 42 forks source link

ENH: Different backends, external catalog formats #622

Open aulemahal opened 11 months ago

aulemahal commented 11 months ago

Is your feature request related to a problem? Please describe. At Ouranos, we use intake-esm to catalog our on-premise data. There are a few types of datasets that produce enormous catalog files, which are then slow and heavy to manipulate in-memory with pandas and intake-esm. (The biggest culprit is the data from our RCM that has a single netCDF file for each variable and month, and there's a good supply of simulations...)

Adjacent problem : intake-esm supports having a list of variables in the variable columns, but that's not cleanly implemented in CSVs, so hacky solutions have to be used.

Describe the solution you'd like It could be interesting to have choice of catalog backend instead of only pandas' DataFrames from CSVs.

For example, polars provide a few performance improvements on pandas. For example it can "scan" a CSV instead of reading it into memory, which at least accelerates the creation of the catalog.

Alternatively, a real database could be more interesting than a CSV if would avoid loading all the lines in memory. Rather, each search call could return a real SQL query.

Or dask's DataFrame ?

In any case, I think the first step would be to generalize ESMCatalogModel so that it can be subclassed for different types of backend. I'm not sure what the minimum API would be though. And I also don't know how this backend choice could be managed in the ESM collection spec itself.

Describe alternatives you've considered Waiting longer for my current code to run is the most common alternative I've used ;).

To have lighter in-memory DataFrames, we pass a series of dtypes to read_csv_kwargs in our main code. See some code in xscen. But that's only doable there because the column names are kinda fixed within the context of the package. The "category" dtype drastically reduces the size of columns with a lot of repetition. pyarrow is useful for string columns as well.

Notice also the hacky code (above) that parses the lists of variables.

Additional context I guess there are two distinct things in my suggestions:

  1. Allowing more input formats that the CSV for external catalogs (sql, parquet, etc)
  2. Allowing a different table backend for potential performance improvement (pandas, dask, polars, sql, etc)

Sadly, I don't have time to work on this myself. However, if this issue gains momentum and is of interest for more than just my group, my organization might be willing to invest some resources, most likely through an internship.

mgrover1 commented 1 month ago

Revisiting this @aulemahal - I think coordinating with @nocollier 's work on intake-esgf would be helpful - within that package, the core idea is implementing different indexes (catalogs in the intake-esm space), that have a set of methods - search, get_file_info, and from_tracking_ids. The package currently works with SOLR databases, as well as the GLOBUS-hosted elasticsearch index.

At the ESGF conference last week, we brought up the desire to coordinate efforts between intake-esm and intake-esm, mainly making the two catalog cross-compatible. Perhaps it would be easier to setup a call to discuss the next steps here? Coordination on this effort would be great!

aulemahal commented 1 month ago

Hi @mgrover1! I would be available for a video call. I can't promise much development time from our side, but at least I can pitch in with ideas and discussion.

mgrover1 commented 1 month ago

Is there a day/time that works best for you next week?

aulemahal commented 1 month ago

Anytime tuesday 3-6pm, wednesday 3-6pm or thursday 9am-4pm. (EDT, UTC-04)