okfn / data-catalog-spec

Data Catalog Specification (Schema and Protocol)
http://spec.datacatalogs.org/
20 stars 4 forks source link

Bulk catalog access option: access to all datasets in a single single file #7

Open rufuspollock opened 12 years ago

rufuspollock commented 12 years ago

This proposes a substantive change to the DCIP spec. Key features

This option could be provided both in addition to and as substitute for the full API option.

Benefits:

Possible problems:

willpugh commented 12 years ago

I think it would actually be nice if there could be two levels of compliance, where the single file download looked like a valid REST endpoint, but with reduced functionality. One level of compliance would be built with Catalogs in mind, and one built mainly with Data Sources in mind. The bulk catalog operations you mention would mainly be targeted at data sources or very small catalogs.

So, if it were structured so that the dataset endpoint would return the JSON of the catalog up to the first 1000 records, and would return full results (so a sync would not require getting the list and then doing a full round trip for each dataset).

Then, any catalog with less than 1000 datasets would be able to provide the simpler access. Any catalog that was providing many datasets, would be required to provide a higher compliance level that provided paging and query by change date, etc.

tgherzog commented 12 years ago

My initial reaction is that from a consumer standpoint the important thing is to have a consistent protocol across all catalogs that implement DCIP, regardless of the size of the catalog.

One approach might be to standardize the variable names used in the responses from the List Dataset API and the Dataset API.

If you standardized fields in this manner, then the catalog could include as many "full dataset" fields as it wanted (beyond the required ones) in the bulk catalog (aka List Data API) listing. For example:

[
  {
    id: "123",
    revision: "1",
    url: "http://data.worldbank.org/catalog/123.json",
    modified: "2012-06-01",
    change_type: "update",
    title: "123 Data",
    publisher: "http://www.worldbank.org", // um, literal or resource here?
    // etc
  },
]

In other words, the bulk catalog and the List Data API are the same, from the consumer's standpoint.

In practice, different catalogs would have the discretion of publishing either complete, nearly complete, or sparse datasets in the bulk catalog, depending on their respective implementations.

Consumers would start by accessing the bulk catalog listing, and request any missing fields via the "url" field.

rufuspollock commented 12 years ago

@willpugh nice suggestion. I guess we still need a way to signal your level of compliance?

rufuspollock commented 12 years ago

Adding a comment from @tgherzog which seems to have gone missing:

My initial reaction is that from a consumer standpoint the important thing is to have a consistent protocol across all catalogs that implement DCIP, regardless of the size of the catalog.

One approach might be to standardize the variable names used in the responses from the List Dataset API and the Dataset API.

If you standardized fields in this manner, then the catalog could include as many "full dataset" fields as it wanted (beyond the required ones) in the bulk catalog (aka List Data API) listing. For example:

[
  {
    id: "123",
    revision: "1",
    url: "http://data.worldbank.org/catalog/123.json",
    modified: "2012-06-01",
    change_type: "update",
    title: "123 Data",
    publisher: "http://www.worldbank.org", // um, literal or resource here?
    // etc
  },
]

In other words, the bulk catalog and the List Data API are the same, from the consumer's standpoint.

In practice, different catalogs would have the discretion of publishing either complete, nearly complete, or sparse datasets in the bulk catalog, depending on their respective implementations.

Consumers would start by accessing the bulk catalog listing, and request any missing fields via the "url" field.

willpugh commented 11 years ago

I like tgherzog's suggestions. I think consistency between the listing APIs and the Dataset API is a good thing in general, and makes this case easier.

There are 3 reasonable suggestions here: 1) Caller just "Figures it out", by following tgherzog's approach, and in the case that they cannot reference an ID directly, they only index what was in list page. 2) The Catalog Entity gets more fleshed out, and it exists there. This entity could exist as an endpoint that could be referenced as a file as well. 3) This could be in the of the homepage as well, e.g.

<meta content="dcip-basic-rest-compliance" value="minimum" />

or

I think #3 seems more elegant.