Guidance on defining collections to group datasets released as a series or in fragments

philipashlock commented 10 years ago

Data.gov has had the notion of a "collection" that can be used to group multiple datasets that would logically be considered a single dataset, but have been released in separate parts. The most common scenario for this is a series of release over time. In some cases a dataset may by published in monthly or yearly releases, but if the only thing that distinguishes these is date, then they should really be packaged as a single dataset. This also makes browsing simpler - it prevents many similar datasets from crowding out more unique ones. Some datasets might also be published by location, such as data relating to each state being released as a separate file. These should also be grouped together to appear as a single dataset.

Ideally agencies should package these all together as a single file/release before publishing, eg one file that is continuously updated is preferable to separate releases over time, but at the very least there should be a way to define this kind of packaged grouping at the metadata level as is currently the case on data.gov.

The way data.gov handles this is that the collection is essentially treated just as a normal dataset entry but it refers to many child entries. Something similar could be done with the data.json schema, but we would need to establish a convention for defining that parent/child relationship between entries.

Here's a current example of a collection on data.gov

View of the collection "parent" metadata: http://catalog.data.gov/dataset/tiger-line-shapefile-2010-series-information-file-for-the-2010-census-block-state-ba

View of all its "child" datasets: http://catalog.data.gov/dataset?collection_package_id=2a8b7f0b-1ae5-453c-ba56-996547266a63

cew821 commented 10 years ago

:+1: Yes this is totally needed. We have a number of datasets at Energy that we would like to consolidate into a single listing rather than have 100s of entries for each year x state.

It sounds like the parent/child "collection" concept used by data.gov is somewhat different than the entry/distribution of datasets currently used by the schema. Should the guidance direct people to put "children" datasets in the distribution array, or is something else needed?

dsmorgan77 commented 10 years ago

Collections are absolutely needed. I could argue either way on the ideal publication path (many of DOT's (current) data customers are states or cities who just aren't interested in downloading the entire Nation's data file and filtering out their information ... we should serve both).

The problems with this are that certain properties will need to filter down to the data file itself. A collection may have a temporal coverage of 1975-present, but an individual file may cover only a single year. A collection may have a geographic coverage of "United States" but a single file may have a geographic coverage of "Alabama." Download URLs will be on a file-by-file basis. Formats might change over time. Data dictionaries may change over time as data elements might be changed.

Clear examples where the collections concept is needed include:

I would highly recommend coordinating with the Federal statistical community on how collections might support them. Groups such as the Statistical Community of Practice and Engagement (SCOPE) will have helpful suggestions on how best to implement.

philipashlock commented 10 years ago

@dsmorgan77 the problems you described seemed to be well served by the model of having a child parent relationship where both the child and the parent would essentially be a fully qualified entry in a data.json. Does that answer your question @cew821?

So no, the children wouldn't be listed under the distribution. Instead the parent wouldn't list any distributions, but it might have a flag indicating that it's a collection, perhaps "collection":true or something and then each child would point to the parent with something like "collectionID":"uniqueid-12345"

Dan raises a good point about being able to be more precise about the temporal bounds for each individual file if something is released in fragments by region, but I think it would still be acceptable to package all those together as distribution under one dataset. This would still allow people to download individual files. You could argue that any dataset could be sliced into smaller pieces and create metadata for it, but some things are already logically packaged as a dataset, so I don't know that we really always need to create extra metadata for subsets like that.

cew821 commented 10 years ago

Yes that makes sense. I think your proposal for additional optional data field indicating the parent makes sense (at least on the child record). I would think you would want to use identifier as the foreign key for the collectionID field (or maybe parentID would be a better name?). I'm not sure you would need to add a field to the "parent" record, but I suppose having some indicator that there are children records to go look for could be useful.

Adopting this approach would also open the possibility for nested parent/child relationships, which I think should be fine (i.e. a record could both be a child of a parent record, and itself be the parent of children records).

lilybradley commented 10 years ago

This is great. I like how the SKOS core guide deals with the "collection" issue: http://www.w3.org/TR/2005/WD-swbp-skos-core-guide-20051102/ - using broader/narrower/associated relationships and meaningful collections of concepts.

Works well with the standards that POD schema already employ.

In case you haven't seen it, HHS is using these standards for its catalog data.jsonld implementation: "@context": { "rdfs": "http://www.w3.org/2000/01/rdf-schema#", "dcterms": "http://purl.org/dc/terms/", "dcat": "http://www.w3.org/ns/dcat#", "foaf": "http://xmlns.com/foaf/0.1/", "pod": "http://project-open-data.github.io/schema/2013-09-20_1.0#" },

raking08 commented 10 years ago

The parent child issue also presents a different problem for my agency. In our case the parent dataset is often not releasable (PII or other reasons) so only the "derivative" or child data set is released. We are still obligated to have the parent/internal only dataset in the directory. The child is often not even stored on the same system (we have hundreds of custom systems for collecting and storing datasets). We are planning on using the unique identifier of the parent as an attribute of the derivative (open) dataset, but are only now considering whether that's looks like a concatenation of the parent U.I + a serialization number or if its two UIs(either way displaying as a collection on data,gov seems straightforward) As our internal catalogue will be in the form of a database that will transmit a file in the correct format, I am wondering how best to construct it in order to minimize the record maintenance. If its a full record, then there will be much redundancy of input which is a source of errors and loss of linkage to say nothing of data steward resistance to double inputs. As we consider the parent child issues is this a concern of anyone? This is not a big deal if 100 datasets, but we have estimated 8-10,000. Has anyone sorted this out?

lilybradley commented 10 years ago

Ranking08, could we solve that kind of problem with entity relationship diagrams (that we could somehow translate into a flat file)? ERDs are a lot like linked data. Similar solution for different problems.

Also NIST is building the NIEM. Your agency might be a good candidate for them as they move into beta.

JoshData commented 10 years ago

+1 for collections.

In HHS's DMS, each dataset is optionally given a Group Name, which is a string. Any datasets with the same group name are organized together in search results. We can map group names to collections in our data.json file, but we don't have any additional metadata for the group itself so we'd have to make up metadata for the parent dataset.

I think it might be simpler to put collection: ['childID', 'childID'] on the parent, and then there's no need for is_collection: true on the parent.

raking08 commented 10 years ago

Thanks Josh.. one point in our case is the parent (for instance say it has PII) in most cases, will not be exposed on data.gov , or any other outward facing site, only the child... so the parent record will be in the internal "all dataset" catalogue and not in the public JSON file. Still, the children should all sort together .. so in this case the relationship should be maintained at the child record and not on the parent record... thoughts?

JoshData commented 10 years ago

@raking08: The parent could be a placeholder, I guess? It would be odd to have an identifier to something but the something isn't listed in the file. (Even if it has PII, it could still be listed but with accessLevel=restricted or nonpublic I think?)

raking08 commented 10 years ago

interesting thought.. but there is significant resistance to having any exposure of the parent ( consider the rest of the metadata that would be exposed) so I do not think listing the parent would be allowed, but I will pose that to security. But as long as the children would be exposed with their different tags for spatial and temporal and any other specifics then it should be OK.. the placeholder could be interesting if there were say 100 children, but you would still need to be able to drill into the placeholder to get to the specific dataset you wanted. How do you see that working?

cew821 commented 10 years ago

@JoshData I think the children need to have their "parentID" (i.e. the foreign key) in their record, since it's a one -> many relationship between parent and children, no? I suppose you could also include a list of all the children in a collection: [child1_id, child2_id] array in the parent record, as a convenience, but I would definitely expect the children records had a parent: parent_id field.

cew821 commented 10 years ago

@raking08 I agree with the placeholder idea - even if all the record contained was a title, unique_id that matched with the children's parent_id and a accessLevel: non-public field, that would probably be enough for the public to understand that the children datasets were a public subset of this larger, non-public dataset.

raking08 commented 10 years ago

Hi CEW, Those are my thoughts too, in fact I am modeling this in separate tables so that the child can inherit characteristics of the parent without needing double data maintenance, but still output the flat file to the public exposure site. Due to additional business requirements for data lifecycle management, we may have an additional layer below the child as well in the internal repository, but it wont be public site relevant. I would be very curious to know if anyone has an internal system that combines managing the core metadata for all data sets ( internal and external) together with data lifecycle management functions.

dsmorgan77 commented 10 years ago

@raking08 I don't see the problem with your particular instance of mentioning a dataset that has PII in your public data listing. I know it's optional, but the public already has notice that there is a system that collects data containing PII. How? Because your agency is publishing Privacy Impact Assessments and System of Records Notices telling them just that. In short, the public already knows.

And, the notion that the minimum required metadata is somehow "too much" metadata is kind of ludicrous. The title, description, and contact point (for a privacy-sensitive dataset, that'd just be the redress point of contact), and agency identifiers are all innocuous and, again, already made public when you release a PIA and a SORN.

raking08 commented 10 years ago

Dear dsmorgan77, Thank you for your input; however, I would caution you to not judge others so quickly when you do not have the specifics. While I don't disagree with your initial assertion that the public knows, it is not my decision (nor yours) if this metadata is to be exposed. It has been made clear to me that certain internal parent datasets will not be exposed even for the minimal metadata, but their children will be. In fact, they do have a very reasonable rational for this position but that is not for discussion in this forum. Others are pointing to a workable solution which I appreciate.

lilybradley commented 10 years ago

@raking08 and other interested parties, it could be helpful to have an offline discussion about the enterprise inventory and metadata for agencies that are facing higher barriers to exposing their metadata for restricted datasets. It would be helpful to discuss good/best practices, lessons learned, etc. Ping me (first.last@hhs.gov) via your .gov email address, if you would like to join an informal conf call discussion.

ghost commented 10 years ago

We've been working on this for a little while and have determined that parent-child relationships force the user to navigate in a hierarchical fashion when often data is related in a multidimensional 'hyper-cube' of attributes / parameters. In other words, I might be interested in other data based on any one of a number of dimensions present in that dataset (e.g., who: subject, where: location, when: years covered, etc.).

My personal take away from all the metadata modeling work that I've been doing for the past year is that it's in order to develop / discover a common metadata standard (potentially versioned to accommodate change) that can adequately describe all relevant datasets so that a single nuanced difference (e.g., as described by Philip at the beginning of this thread) can serve to both filter out all non-relevant datasets while grouping the most closely related datasets in a manner that enables the relationship to be implied by the proximity of two or more datasets to each other in the results of a given query.

philipashlock commented 10 years ago

It sounds like there might be broader interpretations of a collection than what I originally had in mind.

What prompted this (and what might be a more solvable initial scope to approach here) are collections that are made up of nothing more than subsets of a master dataset. This means there are no transformations or other alterations going from the parent to the child other than excluding the rows that don't fit into the subset. In other words, the parent and children should have the same schema and column headings and you should be able to aggregate or concatenate all children to represent the parent.

I don't think this has to be a strict requirement, but it would be helpful to be clear and consistent about what's implied by the relationship of collections.

Some examples of this are the portions of TIGER Line data that are not available as a national file and require you to download all subsets and merge them together if you want national coverage.

You could also argue that TIGER data is released as a time series with each date of collection and publication so it would also make sense to put each year as a child of a parent for all historical TIGER data. While this would extend the family tree another level, I think you could still indicate that relationship without it becoming a strict hierarchy that users are forced to navigate.

If you were to aggregate all releases of TIGER data as a time series it's pretty clear that you couldn't aggregate those all into one master file because of changes in the schema and format of the data from year to year, but I think it would still be valuable to indicate they're all part of the same overall collection.

One way to distinguish between these different ways of defining collections would be that if the collection is comprised of nothing more than subsets of a master dataset, than the parent of the collection should also provide a merged file of everything in the collection as a listing in the distribution

Although it would probably be better to be explicit about the relationship if we're allowing a variety of different kinds.

ghost commented 10 years ago

Ah, I see. So, in this case, would we be talking about a series rather than a parent:child relationship?

philipashlock commented 10 years ago

To date I think the only way we've discussed identifying something as being part of a series is here in this issue as being part of a collection, so yes, as a parent:child relationship.

That said, there is some nuance about what makes a series a series. If the nature of the way the data is collected is slightly different or the structure of the data is slightly different then maybe it's not part of the same series in the most strict technical sense, but from a more intuitive perspective it might still fit that description.

More often what I see is something that is collected on an ongoing basis and released as monthly files. In that case the structure of the data is exactly the same and the files are only being released by the month to make them smaller or so that people don't have to download a larger annual file or whole master file just to get the most recent update. In that case, the "series" would easily fit the definition of a collection just being comprised of strict subsets of a master file.

ghost commented 10 years ago

Got it. As @lilybradley mentioned, Dublin Core might have some useful features in this case:

http://dublincore.org/documents/2012/06/14/dcmi-terms/?v=terms

lilybradley commented 10 years ago

More specifically, I find the diagrams below helpful. While I understand it is not a currently practical priority, these diagrams also convey how data.gov will/could interact with the semantic web via DBpedia.

ex-lab-col-rel ex-ord-col

gbinal commented 10 years ago

I'm in favor of this as an optional field. It seems like next we need someone to articulate an exact vision for what this would actually look like in practice.

katucker commented 10 years ago

A generalized link array property could handle the simple parent-child relationship but retain the flexibility to support other relationships between datasets. RFC 5988 describes the link property from an HTML, HTTP and Atom perspective. A similar approach could be used in the Common Core Metadata Schema.

A registry of Link Relationship Types is maintained by IANA. That registry includes the item relation type for a parent dataset to reference all the children, and the up relation type to reference the parent from the child. It also contains next, prev or previous, start, first, and last relation types to navigate datasets at the child level.

smrgeoinfo commented 10 years ago

+1 on a 'link' or 'relatedResources' array property

philipashlock commented 10 years ago

I'm going to suggest the Dublin Core isPartOf property on each child dataset referencing the identifier of the parent. This property is also used by schema.org. Dublin Core defines it as:

A related resource in which the described resource is physically or logically included.

isPartOf would only be used for datasets that are subsets of the larger collection. For anything that is derived or transformed from another source dataset, I would recommend the Dublin Core source property referencing the identifier of the source. Dublin Core defines it as:

The described resource may be derived from the related resource in whole or in part. Recommended best practice is to identify the related resource by means of a string conforming to a formal identification system.

dsmorgan77 commented 10 years ago

:+1: for isPartOf ... do we need to have the hasPart property on the parent dataset?

project-open-data / project-open-data.github.io

Guidance on defining collections to group datasets released as a series or in fragments #258