Closed sandhujasmine closed 1 year ago
I'd assume that hosting the data in .data/ is needed to make the project self-contained; projects are meant to have everything in one directory so that they can be deleted rather than leaving files strewn across the file system, particular for potentially large datasets and environments. Having it in ./data may also be important for certain workflows we've implemented; not sure.
Here's another error that applies to local catalog.yml but not one that is remotely hosted.
Both of these are not checked if the catalog.yml
is hosted on s3. I verified this so I could do the rest of the testing by hosting it on a temporary s3. This will move to s3 once I know where to put it but still question whether it makes sense to reduce/remove some of these constraints.
If it's a requirement then I would expect it to be consistently applied to local catalog.yml and the ones hosted on s3. I understand this is most likely because it's easy to check for local catalog.yml
. I also don't see an issue with letting intake use the default cache or letting the end user customize it if desired.
The reason for constraining INTAKE_CACHE_DIR
to be set to the local data
folder is that this was the recommended, but not constrained, rule before I overhauled the infrastructure. See https://examples.pyviz.org/make_project.html#if-using-intake-optional
I made it a hard constraint as:
intake
.Constraints can and should be relaxed if they get in our way, I don't feel this one falls into that category?
Both of these are not checked if the catalog.yml is hosted on s3.
As far as I can see (quick check), the landuse classification example you're updating is the only one that is pointing to a remote catalog. All the other examples that leverage intake seem to be pointing to a local catalog. I believe good practice for us is to have:
I believe 1. and 2. can be achieved, the project you're updating being the only one not having a test catalog and its catalog looks like it could just be saved locally:
sources:
UCMerced_LandUse_all:
description: Labeled images from UCMerced_LandUse/Images
origin: http://weegee.vision.ucmerced.edu/datasets/landuse.html
driver: xarray_image
cache:
- argkey: urlpath
regex: 'earth-data/UCMerced_LandUse'
type: file
args:
urlpath: 's3://earth-data/UCMerced_LandUse/Images/{landuse}/{}{id:2d}.tif'
storage_options: {'anon': True}
concat_dim: [id, landuse]
coerce_shape: [256, 256]
UCMerced_LandUse_by_landuse:
description: All images matching given landuse from UCMerced_LandUse/Image.
origin: http://weegee.vision.ucmerced.edu/datasets/landuse.html
driver: xarray_image
cache:
- argkey: urlpath
regex: 'earth-data/UCMerced_LandUse'
type: file
parameters:
landuse:
description: which landuse to collect
type: str
default: airplane
args:
urlpath: 's3://earth-data/UCMerced_LandUse/Images/{{ landuse }}/{{ landuse }}{id:2d}.tif'
storage_options: {'anon': True}
concat_dim: id
coerce_shape: [256, 256]
UCMerced_LandUse_by_image:
description: Image matching given landuse and id from UCMerced_LandUse/Image.
origin: http://weegee.vision.ucmerced.edu/datasets/landuse.html
driver: xarray_image
cache:
- argkey: urlpath
regex: 'earth-data/UCMerced_LandUse'
type: file
parameters:
landuse:
description: which landuse to collect
type: str
default: airplane
id:
description: which id to collect
type: int
default: 0
args:
urlpath: "s3://earth-data/UCMerced_LandUse/Images/{{ landuse }}/{{ landuse }}{{ '%02d' % id }}.tif"
storage_options: {'anon': True}
However, 3. is a bit of a different story. You may have seen that the test_data
folder is not an example but contains datasets that are used for testing the projects. There's a very practical reason for relying on test data I believe, which is that some projects rely on datasets that are difficult to get (e.g. very large) and it'd be just too impractical testing them in a CI job, yet you still want to test them so instead you rely on smaller data sets for which the notebooks/apps still run. Another reason I like that I like having test data, and that I think is even superior to the first reason, is that it sort of forces us to collect samples of the original dataset, which means that when the dataset is no longer available (which might happen at some point, who knows!), we still have a way to run the project and we'd be in a much better position for finding a dataset replacement.
Three examples have defined test data when they leverage intake
:
I quite like how heat_and_trees approaches that, by saving locally samples of the original data (I think). walker_lake approaches it differently, by modifying the catalog to download smaller chunks, which is practical for testing purposes but doesn't cover us if the dataset goes away.
In addition to landuse_classification, 6 projects rely on intake
.
./datashader_dashboard/catalog.yml
./seattle_lidar/catalog.yml
./carbon_flux/catalog.yml
./walker_lake/catalog.yml
./landsat_clustering/catalog.yml
./heat_and_trees/catalog.yml
seattle_lidar
, carbon_flux
and walker_lake
are the only ones from that list that do not have test data.
Given there are not so many projects that would need to be updated with respect to their catalog test data, I'm in favor of keeping the constraint. We can see how it goes when we update these projects. Git isn't ideal for storing data and maybe for some of these projects it's just not possible to come up with test data that is small enough. We can, or even should, host them on HoloViz's S3.
Thank you @maximlt - all of these are valid arguments for the hard testing constraints. As you point out, the landuse example is the only one skipping these tests by hosting the catalog remotely; which should be corrected by checking in the catalog with the notebooks - it's good to keep the catalog w/ the notebooks as a general rule.
For any new examples, the best practice should be to do the same so the tests and constraints are consistently applied to everything.
Closing this issue but will review our docs and see if an update is needed with the contents of your thoughtful response above. It would be good to add this someplace to our documentation if it isn't already there.
I checked in an updated
catalog.yml
locally to get the landuse_classification example updated and working. This raised the following CI error:If the catalog.yml is hosted on s3, this error is not raised - looking at the dodo.py,
has_intake_catalog
check to see if there is a local catalog.yml: https://github.com/holoviz-topics/examples/blob/main/dodo.py#L1210-L1217The data itself is on s3 and by default intake caches the data in
~/.intake/cache
: https://intake.readthedocs.io/en/latest/catalog.html#caching-source-files-locallyTwo questions:
./data
but why would we require this?catalog.yml
- shouldn't it also apply to the catalog file hosted remotely as well?I'm primarily asking / trying to understand why we require a custom cache location?
Adding the following passes the test so will do that for now in any case.