acouch commented 8 years ago

Are there any guidelines for what to for catalogs that are trying to be POD compliant when publishing federated data from non-compliant sources?

For example the City of Chicago doesn't provide a compliant contactPoint: https://data.cityofchicago.org/data.json

In case they are able to update that here is a screenshot:

If a catalog is grabbing a dataset like the one above with a non-compliant contactPoint, what should they do to stay in compliance?

Should they add their own contactPoint even if they are not the actual contactPoint for that dataset? Or should they provide a dummy address that indicates that a valid contactPoint wasn't provided?

rebeccawilliams commented 8 years ago

Thanks @acouch. FWIW, non-federal datasets do not require contactPoint information to be successfully harvested by Data.gov. This should be updated in the schema documentation and the usage notes.

I have also opened an Issue on the Project Open Data Validator to stop showing errors to this point.

acouch commented 8 years ago

@rebeccawilliams I'm not sure this answers the question: what do federal catalogs do if they are harvesting from non-federal sources?

Say "Example Federal Agency" harvests the above dataset from Chicago and provides that dataset in "Example Federal Agency's" data.json file.

Since no contactPoint is provided, should "Example Federal Agency" provide their own contactPoint even though "Example Federal Agency" does not own that dataset? Alternatively should "Example Federal Agency" provide a dummy contactPoint or something to indicate that the contactPoint is not known?

JJediny commented 8 years ago

Seems like the specific scenario you provided...

Say "Example Federal Agency" harvests the above dataset from Chicago and provides that dataset in "Example Federal Agency's" data.json file.

... shouldn't be the case - as I would think an Agency should only be re-registering/hosting a dataset if they have substantial altered/edited/added to it to the point it would be it's own derivative work.

But I agree that how to properly re-host data and provide a clear means of interpreting the original to derive work is not clear I think a concern/ potential enhancement would be to allow the contactPoint to be handled as nested within data.json v1.1 much like how Publisher is currently used - as a nested/repeatable listing of entries that inherent hierarchy from their order... that would avoid distrupting the schema but at the same time allow for the schema to address inheritance for someone to be able to derive the canonical/authoritative source.

With this though another field with hardcoded/validated terms should be discussed in order to denote contactPoint's role in the management of the data AND/OR "in which way or why" an agency has determined it's delta from the original work (i.e. simple rehosting for better accessibility, altered structure of the data to conform to their need, added/appended their data to it maintaining the data's structure and republishing, etc.?)

JJediny commented 8 years ago

On second glance - looks like there was some discussion on the Issues below had tentative agreement on utilizing publisher not contactPoint for this. Regardless of approach there still would be a need to establish a new class of terms to classify the relationship/role to the original/authoritative data source instead of only using subOrganization?

Related Issues:

296

393

philipashlock commented 8 years ago

Just to reiterate John's points, as discussed in #296 and #390 agencies should not be including datasets from other entities in their data.json but this is something we could revisit in a future version of the schema that allows more nuanced provenance (#393).

@rebeccawilliams currently we don't specify any different requirements for non-federal sources in the schema other than distinguishing which fields are essentially exclusive to the US Federal Government and I'd be pretty hesitant to change that approach.

I'm not aware of any decision to not require contactPoint for non-federal sources. The current non-federal validator does require contactPoint.fn, but it doesn't require contactPoint.hasEmail- however if a value is provided for contactPoint.hasEmail it must be valid.

ksharpless commented 8 years ago

I think you're saying that we won't include metadata for datasets generated by non-federal entities. But what if those datasets are results of federally funded research - generated by someone through a grant or contract. Would we not be permitted to include those even if the terms and conditions of the funding agreement allowed it?

acouch commented 8 years ago

agencies should not be including datasets from other entities in their data.json

Since this is the conclusion then there doesn't need to be any further guidance on harvested datasets.

I think it would be good to include this in the documentation if it isn't already. A sentence clarifying that in this section https://project-open-data.cio.gov/implementation-guide/#b-create-and-maintain-a-public-data-listing would make sense. If @rebeccawilliams or @philipashlock agree I could submit a PR to include that.

project-open-data / project-open-data.github.io

What to do if no contactPoint provided? #513

296

393