radiantearth / stac-spec

SpatioTemporal Asset Catalog specification - making geospatial assets openly searchable and crawlable
https://stacspec.org
Apache License 2.0
788 stars 179 forks source link

Best practices for static catalog filesystem layout #408

Closed mojodna closed 5 years ago

mojodna commented 5 years ago

This is a bit more general than #403, but helps that effort substantially.

  1. Root documents (catalogs / collections) should be at the root of a directory tree containing a static catalog (e.g. /home/seth/static-catalogs/sample/catalog.json)
  2. Catalogs should be named catalog.json (cf. index.html)
  3. Collections that are distinct from catalogs should be named collection.json
  4. Items should be named <id>.json
  5. Sub-catalogs and items should be stored in subdirectories of their parent (and only 1 subdirectory deeper than a document's parent) (e.g. .../sample/sub1/catalog.json) -- this means that each item and its assets are contained in a unique subdirectory
  6. ~Items should be stored in the same directory as their parent catalog (e.g. .../sample/sub1/item1.json when a part of the sub1 catalog)~
  7. Reverse relative links (name, implementation, necessity TBD) should not start with ../ ((5) should satisfy this)
  8. Limit the number of items in a catalog or sub-catalog, grouping / partitioning as relevant to the dataset
m-mohr commented 5 years ago

I like that, but not sure whether I like the reverse links.

A note regarding (4): This could lead sometimes to name conflicts, I guess. If the asset is a GeoJSON called abc.json and the id for it is abc, then the item file would also be named abc.json and that wouldn't work as you can't have files with the same name in a folder. We'd need a prefix or suffix for our items, maybe abc.item.json or so. @cholmes

cholmes commented 5 years ago

Sounds great, I like the idea of recommendations. @mojodna - if these are followed will that enable stac browser to make nice URL's with the performance of the encoded stuff now?

I'm comfortable with us 'taking' .json. Especially since this seems like a 'best practice' / recommendation thing. Spec does not say you have to name it that, and so if you have a good reason to not name it that it's ok.

I also don't love the notion of using STAC to just refer to a JSON document. Imho you should be using a WFS, and refer to that service. Like a set of JSON features should be in a collection, not in an item. There is cases where I think it's ok, like the 'training data extension', where the 'asset' is the geotiff plus the geojson. But in that case I think the training geojson doesn't need to 'take' the .json - it can easily rename.

matthewhanson commented 5 years ago

Great ideas @mojodna

1-2: :+1: I named catalogs catalog.json for Landsat and Sentinel but collections are also catalog.json. I like the idea of naming them collection.json

  1. I'm not too worried about @m-mohr's issue, as I think in most cases assets should have a suffix indicating the asset key. e.g., _rgb.tif for the asset key rgb. It does affect those cases where someone has already existing data and wants to put STAC metadata alongside those items, but in those cases they can just choose to disregard this recommendation. Perhaps a secondary recommendation of calling it item.json if you don't want to use <id>.json would satisfy @m-mohr's concern.

  2. Yes, sub-catalogs as sub-directories.

  3. This is the one that I disagree on. What I have done is that any STAC entity (catalog, collection, or item) is always one directory below its parent. So a catalog containing items will point to one directory lower so that each STAC Item has its own directory. This is because of the case of STAC metadata being alongside the data, which I think will be more and more the case. In those cases each STAC item will have multiple assets and it's better to have those in their own directory rather than alongside all the assets from all the other items in the catalog. Also, this means that a parent of any STAC entity is always one directory up, no exception for Items. In the case of Landsat there are path and row subcatalogs and the row catalog contains links to the items which are <date>/<id>.json

  4. Looking at my catalogs I'm certainly no fan of ../../../../../catalog.json so this seems reasonable but I've still got to read through that issue again and ponder.

mojodna commented 5 years ago

if these are followed will that enable stac browser to make nice URL's with the performance of the encoded stuff now?

It should, yes. @matthewhanson's alternate take on (6) complicates things somewhat; we can't do both and have a single rule for resolving slugs to URLs.

(I agree with his take on (6). I updated the ISERV catalog using my version and it made sense because the actual assets are in a different bucket. @matthewhanson can you take a stab at re-phrasing that item?)

I don't know of any consumers that assume that IDs match filenames, it's more a convenience for catalog management. If they don't, no big deal.

fredliporace commented 5 years ago

@mojodna The CBERS implementation follows all proposed items except (3) and (7).

(3), I have a 'collections' subdir from root with all collections, and call each collection using COLLECTION_ID_collection.json, for instance CBERS_4_AWFI_collection.json. I did it that way because it was the easier path to extend from 0.5 to 0.6, I simply kept the 0.5 directory structure I was using before and included the link to the appropriate collection from the item.

I agree with @matthewhanson regarding (6) for the case where metadata and data are within the same storage, but when that is not the case the requirement of having one subdir for each item would end in a lot of directories with a single file. Maybe this could be a recommendation that changes depending on the data and metadata being or not alongside.

mojodna commented 5 years ago

having one subdir for each item would end in a lot of directories with a single file

Probably better than a single directory with too many files. I'll update the issue body.

CBERS' implementation of (3) would require custom resolution rules for collections (for the goal of generating more legible URLs), so maybe that's something to consider when porting the catalog to 0.7.0. In the meantime, I'm fine writing those rules (since they're just for collections and the implementation is consistent).

(7) remains very much up for debate, so that's definitely the loosest recommendation of them all.

mojodna commented 5 years ago

Actually, if items are to be stored in subdirectories, perhaps they should be named item.json a) for consistency, b) to avoid more common collisions, and c) to reduce redundancy (.../<id>/<id>.json)

cholmes commented 5 years ago

Yeah, I think there will be some implementations where it'll feel overkill to have folders for just an item and a single asset, but I think it'll be worse to have a single folder where each item has a lot of assets. So if we just pick one I'd lean towards one item per folder. And then, yes, agree the recommendation should be that folder name is and item is item.json, which also then address Matthias's concern.

Happy to update my little planet stac catalog to be in line with the recommendations.

fredliporace commented 5 years ago
  1. I'm not too worried about @m-mohr's issue, as I think in most cases assets should have a suffix indicating the asset key. e.g., _rgb.tif for the asset key rgb. It does affect those cases where someone has already existing data and wants to put STAC metadata alongside those items, but in those cases they can just choose to disregard this recommendation. Perhaps a secondary recommendation of calling it item.json if you don't want to use <id>.json would satisfy @m-mohr's concern.

I prefer using <id>.json, use cases such as sending a couple of files as attachments are much simpler this way - imagine receiving 10 files named item.json and needing to find a particular item. If there is a potential clashing between assets and ids it is always possible to define the stac id in in a way to suppress the clash, such as prefixing the id with "STAC_".

fredliporace commented 5 years ago

Yeah, I think there will be some implementations where it'll feel overkill to have folders for just an item and a single asset, but I think it'll be worse to have a single folder where each item has a lot of assets. So if we just pick one I'd lean towards one item per folder. And then, yes, agree the recommendation should be that folder name is and item is item.json, which also then address Matthias's concern.

Some filesystems have limits on the number of files and directories that they may hold. I'm almost sure that, for ext3 for instance, having a directory with a single file double the required number of inodes compared to items in the same directory. This may be a problem when we are in the ballpark of millions of items.

Why not define as best practice to limit the number of files within each subdir? This is more flexible and accommodates both situations - if we have the metadata and assets together we create subdirs to limit the number of files, but if we don't have this situation we may place the items within the same subdir.

This also guides the developer to partition the subcatalogs in a way that the tree is balanced and not too 'wide'. Imagine for instance a single catalog with each item as a child in its own subdir. This complies with the recommendation as it is defined right now but I don't think is a best practice, the subcatalog structure should partition the data in more sensible way, evenly distributing the files.

mojodna commented 5 years ago

I added clarifying language above: "this means that each item and its assets are contained in a unique subdirectory" (yes, inodes. I've been there. sigh)

if we have the metadata and assets together we create subdirs to limit the number of files, but if we don't have this situation we may place the items within the same subdir.

I appreciate this intent, but am having trouble imagining resolution rules (for STAC Browser) that would be able to infer the difference in structure...

cholmes commented 5 years ago

Why not define as best practice to limit the number of files within each subdir?

+1 - I think this is a good idea from a 'human browser usability' perspective. Going to a sub-catalog that has 100,000 items in it doesn't make for good browsing / discoverability. Definitely keep it best practice, but I think it's a good one to have on the list.

matthewhanson commented 5 years ago

I agree with balancing the number of files in a directory.

So if item metadata is alongside data this means each Item is likely it's own subdirectory.

If the data is located somewhere else than the items should appear at the same level as the catalog.

Note that providers need to take care to create sub-catalogs though so that any single sub-catalog doesn't contain thousands of items.

It would also be useful to include the most common scenario we see for global multi-temporal data. Sub-divide the data by a series of regions first ("column", then "row", or some), then by date.

cholmes commented 5 years ago

closed with #428