votinginfoproject / vip-specification

The Voting Information Project XML specification.
http://vip-specification.readthedocs.io/en/release/
Other
75 stars 30 forks source link

Remove duplication between SimpleAddressType and Contact types #112

Closed cjerdonek closed 9 years ago

cjerdonek commented 9 years ago

It looks like there is duplication between the SimpleAddressType and Contact types.

cjerdonek commented 9 years ago

It looks like SimpleAddressType is only used by PollingLocation, which also has HoursOpenId. So one possibility would be to change PollingLocation to use Contact instead.

In addition, another possibility would be to have Contact adopt SimpleAddressType's more structured form of address, instead of its current unstructured form:

<xs:element name="AddressLine" type="xs:string" minOccurs="0" maxOccurs="unbounded" />
jungshadow commented 9 years ago

@cjerdonek The latter option seems like the most flexible and the "best of both worlds."

cjerdonek commented 9 years ago

Okay, I will make a PR.

jungshadow commented 9 years ago

@cjerdonek Great!

cjerdonek commented 9 years ago

Created PR #114.

jdmgoogle commented 9 years ago

TIMEOUT! :)

I understand the desire to have a single address type in the spec, but unfortunately (for now) we need to have Contact use multiple lines (for 1622.2 compatibility) and VIP use the SimpleAddressType.

The TL;DR version is that 1622.2 will likely the basis of an international election results reporting standard, one for countries for which the "city, state, zip" framework doesn't work, whereas VIP continues to need to support street segments for its core mission of polling place lookups.

My counter-proposal is to open a post-5.0 issue to locate and use one address standard which fits both of these use cases. So please drop #114. and I'll file a new issue as a placeholder for a long-term address standard.

cjerdonek commented 9 years ago

My counter-proposal is to open a post-5.0 issue to locate and use one address standard which fits both of these use cases.

FYI, It's a bit aggressive to close this issue and reopen one that is basically the same. I think the more appropriate thing to do would have been to continue the discussion here and change the milestone from 5.0 to post 5.0. Why was this issue closed?

jdmgoogle commented 9 years ago

I suppose I could have kept this one open and renamed it. Primarily I wanted something different to reflect the different and larger scope of the issue. Removing duplication between the two objects is an action that will flow out of the address specification evaluation process, but the actual activity that needs to happen before that is to identify and evaluate different addressing standards.

cjerdonek commented 9 years ago

Okay, I can see how the scope of that issue is more general. I do think, though, that as a matter of courtesy you should discuss before immediately closing.

I'm going to reopen this issue and mark as "Up for Debate" because I do think there is something worth doing in the short term -- or at least discussing it.

jdmgoogle commented 9 years ago

I apologize for closing it preemptively and not giving you a chance to digest my comments. That was overly abrupt and I apologize.

However, on the underlying issue my original stance hasn't changed. Completely revamping address formatting in either 1622 or VIP is not feasible in time for 5.0, so whatever changes get made will have to be post-5.0.

cjerdonek commented 9 years ago

I'm not advocating for a complete revamp. My only suggestion here is that, since VIP is already collecting data only from US states, why not preserve the City / State / Zip breakdown in the VIP feeds, as opposed to dropping that breakdown? It would be trivial to convert to the 1622.2 format (simply concatenate the strings).

jdmgoogle commented 9 years ago

This conversation seems to be sprawled out among several different threads (partially my fault). Let's decide on one place to have this conversation and close out the other ones.

cjerdonek commented 9 years ago

Let's continue the discussion here in the issue thread then. PR #114 can be closed or merged after reaching resolution here.

Continuing the PR #114 discussion after this comment:

This is assuming that all the data about all contact information -- including data they get from campaigns, etc -- is already provided and stored natively in city, state, zip format.

Well, my point is that if it is available, it would be better to preserve that structure (and optionally drop it in downstream processing), as opposed to forcing it to be absent from the outset. There is a secondary issue, too, that EMS's look to VIP for what they should be doing if they aren't already. If the breakdown isn't present in the spec, then EMS's aren't as likely to add it.

jdmgoogle commented 9 years ago

Yes, but if it isn't available then they're completely out of luck, and we lose data. The point that "any address can be converted to an array of unstructured strings" argues for using a looser address format in all places where it isn't absolutely necessary.

A similar rationale -- "an unstructured string should not be completely removed in order to force adoption of structured data when said structured data may not be available" -- was the reason we have both structured hours and an unstructured hours string side-by-side in VIP objects. Except in this case having fields for both structured and unstructured addresses side-by-side in the same object doesn't make sense, and would not actually allow us to remove either one of the address types.

Thus -- if we're going to be consistent in our reasoning -- we should force structured data as the only option only where it is feasible to do so, and allow everything else to fall back on unstructured data.

cjerdonek commented 9 years ago

To address the "out of luck" issue, we could do something similar to what we do in cases like hours (HoursOpenID in addition to free-form text) and enumerations (allow falling back to an "other" free-form value). We could simply let the City, State, Zip info be added as a free-form address line if not present in structured form (in which case the structured elements would be blank). The precise way to implement the fall-back in the spec is less important to me than providing the option to include that data in a structured way.

jktomer commented 9 years ago

To be honest, structured addresses aren't all that useful anyway: noncompliance is rampant in existing feeds, leaving us to clean up every kind of shocking edge case imaginable (real example I recall from November 2014: the building name is line 1, room number is line 2, everything else is blank except the ZIP code which contains the entire mailing address).

The addresses we return to users are structured, but they are passed through the Google geocoding service and are at least a little bit reliable as a consequence. I wouldn't recommend any direct VIP feed consumer rely on structured addresses without access to a high-quality address normalization service (like Google geocoding, a USPS CASS-certified address corrector, or something equivalent). So having the addresses structured isn't an obvious win, to me.

The only exception is street segments themselves, which need a certain amount of structure in order to properly encapsulate address ranges. Even there, in practical terms, from Google's perspective the only things that need to be separate are the start and end house numbers; we pretty much just concatenate the rest together into a single string to normalize anyway.

cjerdonek commented 9 years ago

Just to be clear, the structured aspect we're talking about here is the City, State, Zip part (which is in SimpleAddressType), and not the street number portion.

I don't think we should use noncompliance as an argument for whether to include something, otherwise the spec would never improve. The solution to non-compliance is to help them comply (unless it shows an aspect of the spec is ill-defined, which I don't think is the case here).

If structured City, State, and Zip isn't actually useful, then it seems like we should be consistent and remove that from our other address types: SimpleAddressType and DetailAddressType. Otherwise, why have it only in those types?

jdmgoogle commented 9 years ago

If structured City, State, and Zip isn't actually useful, then it seems like we should be consistent and remove that from our other address types: SimpleAddressType and DetailAddressType. Otherwise, why have it only in those types?

Because:

Thus -- if we're going to be consistent in our reasoning -- we should force structured data as the only option only where it is feasible to do so, and allow everything else to fall back on unstructured data.

and

The only exception is street segments themselves, which need a certain amount of structure in order to properly encapsulate address ranges.

Although the one part that @jktomer left out is wildcard street segments, which do need street names and/or city names broken out explicitly. Otherwise, yes, one needs a geocoding service to properly ingest the VIP address data.

jktomer commented 9 years ago

I'd be fine with removing most of the structure from SimpleAddressType, though I don't think it's worth delaying the 5.0 spec to do so.

And yes, @jdmgoogle is right, I forgot that wildcard segments do need more detail than standard ones.

cjerdonek commented 9 years ago

My point was that if it's not useful as @jktomer was saying, then why break it out at all, even if it's possible? If it's because there is some use, then we should be allowing it elsewhere.

jktomer commented 9 years ago

The specific use of address components in a street segment (as opposed to a point address for a polling location or whatever) is to identify the part we should use for a special comparison (range comparison for house numbers, string match after normalization for city name in wildcards) to match. That doesn't apply to, for example, polling location addresses, where we're being given a point address that we will never trust as-is and will always normalize.

So yes, I'd be fine with moving to unstructured addresses for polling locations, election officials, candidates, and anything other than street segments, but I wouldn't block the 5.0 release on it.

jdmgoogle commented 9 years ago

To summarize @jktomer's point and to try and move forward on this, if we're going to change anything pre-5.0, my proposal is:

1) Leave the address in Contact as a set of unstructured strings; 2) Change the PollingLocation object to use 1+ strings as the address; and 3) Remove SimpleAddressType.

This is effectively option (1) that @cjerdonek suggested up at the top, except that it mandates using one or more string in the address line of the PollingLocation. The reasoning behind that is that if a PollingLocation address looks like this:

  <xs:element name="AddressLine" type="xs:string" maxOccurs="unbounded" />

instead of this:

  <xs:element name="ContactId" type="xs:IDREF" />

it'll be easier to catch polling locations missing addresses with a simple schema validation instead of more complex logic. In fact it would be possible for a PollingLocation to have no address information whatsoever and still be a valid schema. I'd rather avoid that.

cjerdonek commented 9 years ago

That sounds fine with me. Thanks for engaging with me on this, @jdmgoogle and @jktomer.

cjerdonek commented 9 years ago

I updated PR #114 to use @jdmgoogle's proposal.

cjerdonek commented 9 years ago

Closed by PR #114.