open-data-standards / permitdata.org

:hammer: A website for the BLDS Data Specification
https://permitdata.org
39 stars 7 forks source link

Mutability and fields (eg created/updated) #61

Open bensheldon opened 9 years ago

bensheldon commented 9 years ago

The scheme should define resource lifecycles and accompanying fields. It should be strict about what modifications trigger a new updated timestamp (hopefully all of them, but it should declare that behavior).

This is important for ETLIng and syncing data.

mheadd commented 9 years ago

@bensheldon Apologies for the late reply on this.

Can you expand on this with a quick example? Thanks!

bensheldon commented 9 years ago

As a user syncing data between my own database and a BLDS data set When a permit is added to the BLDS data set (not necessarily when it is accepted by the city), it should have created_at and updated_at touched And when the data row is changed within the BLDS data set, it should have updated_at touched

From my experience with Open311, in which there often is both an underlying Ticket system, and an intermediary/vendor/integrator serving the Open311 data, there is confusion around whether the timestamp fields represent the canonical Ticket system, or the intermediary's data and led to situations like:

A public open311 record was modified, but this was not reflected in "updated_at" because the modification was the result of a change to a secondary dataset that was integrated by the intermediary, rather than the primary record changing. In this case, the intermediary interpreted "updated_at" to only reflect changes to the primary record, not any secondary records, even though it triggered a change to the data served by Open311.

An analogous situation here might be: a building permit was not changed, but a separate contractor form was amended, causing a change to a contractor1 field. IMO, this should trigger updated_at and this behavior should be part of the specification.

mheadd commented 9 years ago

Ah, that makes sense. Any thoughts on this @axtheset?

mmartin78 commented 9 years ago

This is very useful, but perhaps should be optional?

bensheldon commented 9 years ago

The benefit of this behavior is ensuring data integrity and improving the efficiency of data syncing between producers and consumers.

To speak again from my experience with the Open311 specification, I think that by solely defining a data schema (syntax), but not defining the behaviors (semantics) of those fields, it makes it very difficult to actually integrate systems that conform to the specification.

mmartin78 commented 9 years ago

I agree with defining the semantics, just think it should be optional because I bet most agencies don't capture this data at all today.

bensheldon commented 9 years ago

Maybe we should expand the discussion to canonical vs non-canonical fields. In asking for both timestamps and a semantics for timestamps, I don't have a preference for whether this represents canonical data (e.g. a datetime that's been stamped on the original form), or non-canonical data (the datetime that's stored in the intermediary database), other than to ask that the representation and semantics be defined as part of the spec.