votinginfoproject / vip-specification

The Voting Information Project XML specification.
http://vip-specification.readthedocs.io/en/release/
Other
75 stars 30 forks source link

Add a changelog feature to the VIP specification #410

Open afsmythe opened 4 years ago

afsmythe commented 4 years ago

Due to ever changing aspects of elections data, the Voting Information Project will often need to make real time updates to public facing data as an interim resolution to handle cases where source data associated with an election has changed. These immediate fixes are most commonly used in cases where a PollingLocation has closed or has been reassigned, but the concept can apply more broadly to other elements within a VipObject, including ballot data. In these cases the source data from the data provider will be updated to include the change, but there remains a time lapse between when the change is known and when it can be collected, processed and ingested in to the data stores which serve the information publicly.

Presently there is no concept a part of the specification to allow for a machine readable changelog of updates. Rather these updates are made on a manual basis in the form of "service tickets". As VIP data becomes more widely distributed there will be a need to handle data changes in a structured, programmatic way to both ensure trust of the data and to provide an easy to read changelog of data change requests. There are two types of these data change requests:

Thanks to some input from Google we've identified a proposal to handle a changelog resource of the Project. I'm creating the Issue here to collect some input from other VIP stakeholders. Thanks for your time and any feedback you'd like to provide!

The proposal to construct a changelog resource is laid out as:

It is proposed that these VipDelta records be added to a separate file a part from the main VipObject data feed file. The delta records will be aligned with the associated VipObject using the existing dual identification eid and nid.

cjerdonek commented 4 years ago

Does this mean the proposal is to use multiple inheritance? Has any thought been given to a lighter-weight approach that doesn’t require introducing a new class for every class?

I don’t know all the requirements this is trying to meet, but I’m wondering if the requirements couldn’t be satisfied for example with a new attribute or two recognized by each object, or perhaps instead with a single “ChangeIndex” object that lists the ids of the objects to be deleted or updated (using the info elsewhere in the document).

On Thu, Aug 20, 2020 at 11:57 AM Franklin Smith notifications@github.com wrote:

Due to ever changing aspects of elections data, the Voting Information Project will often need to make real time updates to public facing data as an interim resolution to handle cases where source data associated with an election has changed. These immediate fixes are most commonly used in cases where a PollingLocation has closed or has been reassigned, but the concept can apply more broadly to other elements within a VipObject, including ballot data. In these cases the source data from the data provider will be updated to include the change, but there remains a time lapse between when the change is known and when it can be collected, processed and ingested in to the data stores which serve the information publicly.

Presently there is no concept a part of the specification to allow for a machine readable changelog of updates. Rather these updates are made on a manual basis in the form of "service tickets". As VIP data becomes more widely distributed there will be a need to handle data changes in a structured, programmatic way to both ensure trust of the data and to provide an easy to read changelog of data change requests. There are two types of these data change requests:

  • block-list; invalidates the record so that it is not returned within the public facing data
  • op-tool; changes the record's data, and the changed record is returned publicly

Thanks to some input from Google we've identified a proposal to handle a changelog resource of the Project. I'm creating the Issue here to collect some input from other VIP stakeholders. Thanks for your time and any feedback you'd like to provide!

The proposal to construct a changelog resource is laid out as:

  • Adding a new base class VipDelta to the specification, extendable with the change type enumerated as [block-list, op-tool]
  • Creating new classes (for example, a PollingLocationDelta) which will extend VipDelta and the existing base class relevant to the delta (in this case, PollingLocation)

It is proposed that these VipDelta records be added to a separate file a part from the main VipObject data feed file. The delta records will be aligned with the associated VipObject using the existing dual identification eid and nid.

— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub https://github.com/votinginfoproject/vip-specification/issues/410, or unsubscribe https://github.com/notifications/unsubscribe-auth/AACW33TF6IXQRRK5VI5GR7LSBVBVPANCNFSM4QGJMGPQ .

JDziurlaj commented 4 years ago

Is this functionality for VIP data providers to state their changes, or for consumers to understand what has/is to change?

afsmythe commented 4 years ago

Thanks for the feedback! There's been some discussion around various other options outside of these threads, and I'll try to summarize those below.

Has any thought been given to a lighter-weight approach that doesn’t require introducing a new class for every class?

The core goal of this is to make real time changes to VIP data machine readable. For a block-list (we're removing a record from public consumption) it is simple (we can simply list the ID of the record to suppress) but for an actual change (like a polling location reassignment, or op-tool) we need to both provide the new data and also be able to assert that the new data record is of the same Class as the original record. Any additional proposals that achieve this are welcome.

I’m wondering if the requirements couldn’t be satisfied for example with a new attribute or two recognized by each object

Could you please elaborate? This sounds interesting.

Is this functionality for VIP data providers to state their changes, or for consumers to understand what has/is to change?

Yes and yes. With the size of VIP feeds (especially considering a street segment point file of a few million records), the processing time of both providing new feeds by data providers and consuming updated feeds by consumers can take at least 2-3 hours in a best case scenario. For a polling location reassignment, or a relocated drop box, this delay would be untenable on election day or in the days leading up to it. We run the risk of surfacing stale information to the public. Currently, the VIP teams at Democracy Works and Google make these data change requests in a manual way in real time, and the goal of this specification change will be to make the data changes machine readable and easy to both provide and consume in as close to real time as is possible.

JDziurlaj commented 4 years ago

The biggest issue I see (not knowing all the details) is the handling of xs:IDREFs. If the changes are to be provided in their own file, either all the depending data (that isn't changing) must be in that file as well, or the change file won't validate. PollingLocation only has one xs:IDREF typed element, but other types will be more difficult to handle. A possible solution is to derive new types by extension (e.g. PollingLocationDelta) and widen the data type of xs:IDREF to xs:NCName. Then the Id elements will need to look like IDREFs, but the references won't be enforced by validators. However, this also assumes the xs:IDREF names have been stored somewhere by the VIP data consumers.

I would note that the guidance we are giving for the NIST CDFs is to not use the ObjectId (your id) for anything other than processing the file, so at least for our use cases, we would expect consumers to NOT have ObjectIds in their system.

Another option is to derive your new types by restriction instead of extension, and disallow specifying xs:IDREF elements at all. Note that you can't remove required elements using derive-by-restriction.