Is Intake catalog check in `validate_data_sources` too strict?

sandhujasmine commented 1 year ago

I checked in an updated catalog.yml locally to get the landuse_classification example updated and working. This raised the following CI error:

% doit validate_data_sources:landuse_classification
.  validate_data_sources:landuse_classification
WARNING: The project has an Intake catalog, it must declare the variable INTAKE_CACHE_DIR in its anaconda-project.yml file and set it to "data".

If the catalog.yml is hosted on s3, this error is not raised - looking at the dodo.py, has_intake_catalog check to see if there is a local catalog.yml: https://github.com/holoviz-topics/examples/blob/main/dodo.py#L1210-L1217

The data itself is on s3 and by default intake caches the data in ~/.intake/cache: https://intake.readthedocs.io/en/latest/catalog.html#caching-source-files-locally

Two questions:

IIUC, the INTAKE_CACHE_DIR will cache the data in ./data but why would we require this?
If this is required for local catalog.yml - shouldn't it also apply to the catalog file hosted remotely as well?

I'm primarily asking / trying to understand why we require a custom cache location?

Adding the following passes the test so will do that for now in any case.

variables:
  INTAKE_CACHE_DIR: data

jbednar commented 1 year ago

I'd assume that hosting the data in .data/ is needed to make the project self-contained; projects are meant to have everything in one directory so that they can be deleted rather than leaving files strewn across the file system, particular for potentially large datasets and environments. Having it in ./data may also be important for certain workflows we've implemented; not sure.

sandhujasmine commented 1 year ago

Here's another error that applies to local catalog.yml but not one that is remotely hosted.

Both of these are not checked if the catalog.yml is hosted on s3. I verified this so I could do the rest of the testing by hosting it on a temporary s3. This will move to s3 once I know where to put it but still question whether it makes sense to reduce/remove some of these constraints.

If it's a requirement then I would expect it to be consistently applied to local catalog.yml and the ones hosted on s3. I understand this is most likely because it's easy to check for local catalog.yml. I also don't see an issue with letting intake use the default cache or letting the end user customize it if desired.

maximlt commented 1 year ago

The reason for constraining INTAKE_CACHE_DIR to be set to the local data folder is that this was the recommended, but not constrained, rule before I overhauled the infrastructure. See https://examples.pyviz.org/make_project.html#if-using-intake-optional

I made it a hard constraint as:

that's the general pattern I've adopted, constraining to make sure things are as we expect and force users to think more about their examples, their data, etc. Also accepting that such constraints can be released if too hard, but I preferred starting with hard ones.
it made sense to me! As a user I think I prefer to see the data being downloaded in a local folder I have easy access to, instead of in a cache folder I will struggle finding if I'm not used to intake.

Constraints can and should be relaxed if they get in our way, I don't feel this one falls into that category?

Both of these are not checked if the catalog.yml is hosted on s3.

As far as I can see (quick check), the landuse classification example you're updating is the only one that is pointing to a remote catalog. All the other examples that leverage intake seem to be pointing to a local catalog. I believe good practice for us is to have:

local catalogs as we'd be in a bad position if the catalog goes away (the data can go away too but at least the catalog would serve as a reference)
local catalogs that users can easily inspect
test catalogs that replace the original one during the test runs

I believe 1. and 2. can be achieved, the project you're updating being the only one not having a test catalog and its catalog looks like it could just be saved locally:

sources:
  UCMerced_LandUse_all:
    description: Labeled images from UCMerced_LandUse/Images
    origin: http://weegee.vision.ucmerced.edu/datasets/landuse.html
    driver: xarray_image
    cache:
      - argkey: urlpath
        regex: 'earth-data/UCMerced_LandUse'
        type: file
    args:
      urlpath: 's3://earth-data/UCMerced_LandUse/Images/{landuse}/{}{id:2d}.tif'
      storage_options: {'anon': True}
      concat_dim: [id, landuse]
      coerce_shape: [256, 256]

  UCMerced_LandUse_by_landuse:
    description: All images matching given landuse from UCMerced_LandUse/Image.
    origin: http://weegee.vision.ucmerced.edu/datasets/landuse.html
    driver: xarray_image
    cache:
      - argkey: urlpath
        regex: 'earth-data/UCMerced_LandUse'
        type: file
    parameters:
      landuse:
        description: which landuse to collect
        type: str
        default: airplane
    args:
      urlpath: 's3://earth-data/UCMerced_LandUse/Images/{{ landuse }}/{{ landuse }}{id:2d}.tif'
      storage_options: {'anon': True}
      concat_dim: id
      coerce_shape: [256, 256]

  UCMerced_LandUse_by_image:
    description: Image matching given landuse and id from UCMerced_LandUse/Image.
    origin: http://weegee.vision.ucmerced.edu/datasets/landuse.html
    driver: xarray_image
    cache:
      - argkey: urlpath
        regex: 'earth-data/UCMerced_LandUse'
        type: file
    parameters:
      landuse:
        description: which landuse to collect
        type: str
        default: airplane
      id:
        description: which id to collect
        type: int
        default: 0
    args:
      urlpath: "s3://earth-data/UCMerced_LandUse/Images/{{ landuse }}/{{ landuse }}{{ '%02d' % id }}.tif"
      storage_options: {'anon': True}

However, 3. is a bit of a different story. You may have seen that the test_data folder is not an example but contains datasets that are used for testing the projects. There's a very practical reason for relying on test data I believe, which is that some projects rely on datasets that are difficult to get (e.g. very large) and it'd be just too impractical testing them in a CI job, yet you still want to test them so instead you rely on smaller data sets for which the notebooks/apps still run. Another reason I like that I like having test data, and that I think is even superior to the first reason, is that it sort of forces us to collect samples of the original dataset, which means that when the dataset is no longer available (which might happen at some point, who knows!), we still have a way to run the project and we'd be in a much better position for finding a dataset replacement.

Three examples have defined test data when they leverage intake:

I quite like how heat_and_trees approaches that, by saving locally samples of the original data (I think). walker_lake approaches it differently, by modifying the catalog to download smaller chunks, which is practical for testing purposes but doesn't cover us if the dataset goes away.

In addition to landuse_classification, 6 projects rely on intake.

./datashader_dashboard/catalog.yml
./seattle_lidar/catalog.yml
./carbon_flux/catalog.yml
./walker_lake/catalog.yml
./landsat_clustering/catalog.yml
./heat_and_trees/catalog.yml

seattle_lidar, carbon_flux and walker_lake are the only ones from that list that do not have test data.

Given there are not so many projects that would need to be updated with respect to their catalog test data, I'm in favor of keeping the constraint. We can see how it goes when we update these projects. Git isn't ideal for storing data and maybe for some of these projects it's just not possible to come up with test data that is small enough. We can, or even should, host them on HoloViz's S3.

sandhujasmine commented 1 year ago

Thank you @maximlt - all of these are valid arguments for the hard testing constraints. As you point out, the landuse example is the only one skipping these tests by hosting the catalog remotely; which should be corrected by checking in the catalog with the notebooks - it's good to keep the catalog w/ the notebooks as a general rule.

For any new examples, the best practice should be to do the same so the tests and constraints are consistently applied to everything.

Closing this issue but will review our docs and see if an update is needed with the contents of your thoughtful response above. It would be good to add this someplace to our documentation if it isn't already there.

holoviz-topics / examples

Is Intake catalog check in `validate_data_sources` too strict? #277