Bulk catalog access option: access to all datasets in a single single file

rufuspollock commented 12 years ago

This proposes a substantive change to the DCIP spec. Key features

Provision of all datasets in a single file
Format would be a simple list of each dataset with each dataset serialized as in DCIP
- with default of JSON but options for n3 etc
Location likely specified by a meta field in head as per API location

This option could be provided both in addition to and as substitute for the full API option.

Benefits:

Catalog operators:
- simpler and easier to do than a full API. Very easy to get started.
Consumers:
- All datasets in one go - no need to walk through the API

Possible problems:

Catalog operators:
- For larger catalogs the file is very large. Inefficient both for creation, storage, and transmission.
Consumers:
- File could be large if catalog is large.
- to get whole file even if only one dataset has changed
  To Discuss
Sign-posting to this file
Relationship to REST option (is this entirely orthogonal?)

willpugh commented 12 years ago

I think it would actually be nice if there could be two levels of compliance, where the single file download looked like a valid REST endpoint, but with reduced functionality. One level of compliance would be built with Catalogs in mind, and one built mainly with Data Sources in mind. The bulk catalog operations you mention would mainly be targeted at data sources or very small catalogs.

So, if it were structured so that the dataset endpoint would return the JSON of the catalog up to the first 1000 records, and would return full results (so a sync would not require getting the list and then doing a full round trip for each dataset).

Then, any catalog with less than 1000 datasets would be able to provide the simpler access. Any catalog that was providing many datasets, would be required to provide a higher compliance level that provided paging and query by change date, etc.

tgherzog commented 12 years ago

My initial reaction is that from a consumer standpoint the important thing is to have a consistent protocol across all catalogs that implement DCIP, regardless of the size of the catalog.

One approach might be to standardize the variable names used in the responses from the List Dataset API and the Dataset API.

Both APIs current implement the "id" field consistently
The "revision" field in the List Data API is named "version" in the Dataset API
The "modified" field in the List Data API looks like it's named "metadata_modified" in the Dataset API (are the meanings consistent)?
The "change_type" and "url" fields from the List Data API are missing from the Dataset API, but could be included in the latter. Perhaps "url" should then be renamed to be less ambiguous.

If you standardized fields in this manner, then the catalog could include as many "full dataset" fields as it wanted (beyond the required ones) in the bulk catalog (aka List Data API) listing. For example:

[
  {
    id: "123",
    revision: "1",
    url: "http://data.worldbank.org/catalog/123.json",
    modified: "2012-06-01",
    change_type: "update",
    title: "123 Data",
    publisher: "http://www.worldbank.org", // um, literal or resource here?
    // etc
  },
]

In other words, the bulk catalog and the List Data API are the same, from the consumer's standpoint.

In practice, different catalogs would have the discretion of publishing either complete, nearly complete, or sparse datasets in the bulk catalog, depending on their respective implementations.

Consumers would start by accessing the bulk catalog listing, and request any missing fields via the "url" field.

rufuspollock commented 12 years ago

@willpugh nice suggestion. I guess we still need a way to signal your level of compliance?

rufuspollock commented 12 years ago

Adding a comment from @tgherzog which seems to have gone missing:

My initial reaction is that from a consumer standpoint the important thing is to have a consistent protocol across all catalogs that implement DCIP, regardless of the size of the catalog.

One approach might be to standardize the variable names used in the responses from the List Dataset API and the Dataset API.

Both APIs current implement the "id" field consistently
The "revision" field in the List Data API is named "version" in the Dataset API
The "modified" field in the List Data API looks like it's named "metadata_modified" in the Dataset API (are the meanings consistent)?
The "change_type" and "url" fields from the List Data API are missing from the Dataset API, but could be included in the latter. Perhaps "url" should then be renamed to be less ambiguous.

If you standardized fields in this manner, then the catalog could include as many "full dataset" fields as it wanted (beyond the required ones) in the bulk catalog (aka List Data API) listing. For example:

[
  {
    id: "123",
    revision: "1",
    url: "http://data.worldbank.org/catalog/123.json",
    modified: "2012-06-01",
    change_type: "update",
    title: "123 Data",
    publisher: "http://www.worldbank.org", // um, literal or resource here?
    // etc
  },
]

In other words, the bulk catalog and the List Data API are the same, from the consumer's standpoint.

In practice, different catalogs would have the discretion of publishing either complete, nearly complete, or sparse datasets in the bulk catalog, depending on their respective implementations.

Consumers would start by accessing the bulk catalog listing, and request any missing fields via the "url" field.

willpugh commented 11 years ago

I like tgherzog's suggestions. I think consistency between the listing APIs and the Dataset API is a good thing in general, and makes this case easier.

There are 3 reasonable suggestions here: 1) Caller just "Figures it out", by following tgherzog's approach, and in the case that they cannot reference an ID directly, they only index what was in list page. 2) The Catalog Entity gets more fleshed out, and it exists there. This entity could exist as an endpoint that could be referenced as a file as well. 3) This could be in the of the homepage as well, e.g.

<meta content="dcip-basic-rest-compliance" value="minimum" />

or

I think #3 seems more elegant.

okfn / data-catalog-spec

Bulk catalog access option: access to all datasets in a single single file #7

To Discuss