publiclab / wherewebreathe

wherewebreathe.org
2 stars 7 forks source link

should be able to infer city/state from zip code #34

Closed jywarren closed 9 years ago

shapironick commented 10 years ago

Also in terms of order, Zip Code (or post code outside of US) should be asked first.

This also relates to the question of Country (as it can be inferred in addition to city/state). We can easily find lists of US, Canadian and Mexican zip/post codes and auto detect for which form they are. Or we could have three zip/post fields, one for each of the countries so we don’t have to waste time working on the auto-detect feature.

So depending on which ZIP field they fill out, the country can be selected in the later country question.

(Keeping Mexico and Canada in the mix relates to some future funding we may apply to and future Hispanic populations that are a major manufactured home constituency along the border. We could leave out Canada and Mexico as we've got other shorter term issues to think about)

mmnoo commented 10 years ago

Jeff do you mean that we dont collect city/state/country or that we infer those for the purpose of autopopulating the other fields for usability reasons?

If the former, I think it is important to collect city.state/country from users for data integrety.

Some people will inevitably fill in a fake postal code because they dont know theirs and are feeling too lazy to look it up. Best to allow them to do that, but then still collect state/country giving more weight to the later if there is a conflict. Also, we dont want to have to throw out data because there is a typo and we cant assign it to a location.

How spatial is the epedemiological research going to be anyway? In terms of spatial distribution of symptoms, wouldnt what is being studied be affected more by larger jurisdictional entities at the , city/region/state/country level who set building policies?

So I guess, a good question to ask in terms of the postal code, city, state, country fields, is how would you rank them in terms of importance for research purposes? What level of spatial precision will you be operating on? Is postal code revalent? Is asking for postal code collecting info that is potentially too personally identifiable (like in rural areas where postal codes sometimes contain only a few people)?

I will start working on data export today and this has me thinking about which fields we will be exporting. Will start a new issue for that conversation though.

shapironick commented 10 years ago

I agree on some redundant spatial information collecting. The zip will be helpful for the epidemiology, and the city, state etc will be helpful for the community building/advocacy (connecting people that are near each other or are operating within similar jurisdictions). Also the zip codes in many of these rural places are huge, so having city may be helpful for the epi too. I think collecting overlapping data in this regard will be helpful later on. And if users can just enter in the zip and get the rest inferred (but then also later validated) then we get the benefit of the multiple perspectives of each version of location without a high user input burden.

Agreed on not share zip/post codes but sharing states + countries would be important. Cities I'm not sure about if we need to share that, or if that will even be helpful to share. But both city, state and zip will be helpful for eventually mapping out the distribution of these contaminated homes.

For VIN Numbers, the majority of the digits will be helpful to share as they identify the maker, the model, the plant it was made in and the year. The last six numbers are just sequential numbers and could be potentially identifying if they were all shared so I think we should not share the last four numbers

HUD numbers are harder to track to individuals. Let's obscure the last three number for privacy. I would think one in a thousand is sufficient anonymity.

Any thoughts on this, Jeff?

On Thu, Jul 31, 2014 at 8:02 PM, Melissa notifications@github.com wrote:

collect city.state/country from users for data integrety.

jywarren commented 10 years ago

OK, i see what you're saying about comparing city/zip etc. Makes sense and thanks for thinking it through.

On Fri, Aug 1, 2014 at 6:49 AM, shapironick notifications@github.com wrote:

I agree on some redundant spatial information collecting. The zip will be helpful for the epidemiology, and the city, state etc will be helpful for the community building/advocacy (connecting people that are near each other or are operating within similar jurisdictions). Also the zip codes in many of these rural places are huge, so having city may be helpful for the epi too. I think collecting overlapping data in this regard will be helpful later on. And if users can just enter in the zip and get the rest inferred (but then also later validated) then we get the benefit of the multiple perspectives of each version of location without a high user input burden.

Agreed on not share zip/post codes but sharing states + countries would be important. Cities I'm not sure about if we need to share that, or if that will even be helpful to share. But both city, state and zip will be helpful for eventually mapping out the distribution of these contaminated homes.

For VIN Numbers, the majority of the digits will be helpful to share as they identify the maker, the model, the plant it was made in and the year. The last six numbers are just sequential numbers and could be potentially identifying if they were all shared so I think we should not share the last four numbers

HUD numbers are harder to track to individuals. Let's obscure the last three number for privacy. I would think one in a thousand is sufficient anonymity.

Any thoughts on this, Jeff?

On Thu, Jul 31, 2014 at 8:02 PM, Melissa notifications@github.com wrote:

collect city.state/country from users for data integrety.

Reply to this email directly or view it on GitHub https://github.com/publiclab/wherewebreathe/issues/34#issuecomment-50871862 .

jywarren commented 10 years ago

And regarding HUD/VIN, one in a thousand is not very secure -- what exactly do we infer from the HUD in those trailing numbers/letters? Who gets access to the whole thing?

On Tue, Aug 5, 2014 at 10:03 AM, Jeffrey Warren jeff@unterbahn.com wrote:

OK, i see what you're saying about comparing city/zip etc. Makes sense and thanks for thinking it through.

On Fri, Aug 1, 2014 at 6:49 AM, shapironick notifications@github.com wrote:

I agree on some redundant spatial information collecting. The zip will be helpful for the epidemiology, and the city, state etc will be helpful for the community building/advocacy (connecting people that are near each other or are operating within similar jurisdictions). Also the zip codes in many of these rural places are huge, so having city may be helpful for the epi too. I think collecting overlapping data in this regard will be helpful later on. And if users can just enter in the zip and get the rest inferred (but then also later validated) then we get the benefit of the multiple perspectives of each version of location without a high user input burden.

Agreed on not share zip/post codes but sharing states + countries would be important. Cities I'm not sure about if we need to share that, or if that will even be helpful to share. But both city, state and zip will be helpful for eventually mapping out the distribution of these contaminated homes.

For VIN Numbers, the majority of the digits will be helpful to share as they identify the maker, the model, the plant it was made in and the year. The last six numbers are just sequential numbers and could be potentially identifying if they were all shared so I think we should not share the last four numbers

HUD numbers are harder to track to individuals. Let's obscure the last three number for privacy. I would think one in a thousand is sufficient anonymity.

Any thoughts on this, Jeff?

On Thu, Jul 31, 2014 at 8:02 PM, Melissa notifications@github.com wrote:

collect city.state/country from users for data integrety.

Reply to this email directly or view it on GitHub https://github.com/publiclab/wherewebreathe/issues/34#issuecomment-50871862 .

shapironick commented 10 years ago

I don't know much about what we can infer from HUD number and can't find anything other what what is up in the wiki as far as an explanation. I'm going to try to find out more when I'm back in the states. What level of anonymity do you think would be sufficiently secure? I see full VIN/HUD numbers as similar to email, only visible to the system admin.

jywarren commented 10 years ago

I guess i'm mainly wondering what we get from asking for a full VIN/HUD -- what will be inferred from it, total? Uniqueness? Geography? Age, model? Because if only system admins see it, there's not a strong reason to ask for it in the first place; it just increases the burden on our security systems. Like storing credit card info unnecessarily. Just to play devil's advocate here.

On Tue, Aug 5, 2014 at 1:03 PM, shapironick notifications@github.com wrote:

I don't know much about what we can infer from HUD number and can't find anything other what what is up in the wiki as far as an explanation. I'm going to try to find out more when I'm back in the states. What level of anonymity do you think would be sufficiently secure? I see full VIN/HUD numbers as similar to email, only visible to the system admin.

— Reply to this email directly or view it on GitHub https://github.com/publiclab/wherewebreathe/issues/34#issuecomment-51235634 .

shapironick commented 10 years ago

I think the uniqueness aspect is important. Also we have a huge amount of VIN numbers that we know are FEMA trailers and eventually it would be great to have a function that tells them if it matches our FEMA trailer database, and if it is a FEMA unit (most won't know for sure with out our verification) they will have certain entitlements and will potentially be able to organize better. A full VIN/HUD is required for that. The 10th anniversary of Katrina is in a year, and I hate to be opportunistic but I see that international media attention as a vital moment for our site, to be able to tell some of the afterlife of emergency housing units and how they cast light on the everyday corrosive experiences of life in manufactured housing. I know that's not something in the budget for now, but its just strategic thinking for how we can get ordinary forgotten exposures in the news and get people involved that might to be difficult to reach otherwise. I totally get the liability, but I think this might be an area that will have benefit for users--also finding identities from them is a lot harder than, say, a license plate.

On Tue, Aug 5, 2014 at 7:29 PM, Jeffrey Warren notifications@github.com wrote:

I guess i'm mainly wondering what we get from asking for a full VIN/HUD -- what will be inferred from it, total? Uniqueness? Geography? Age, model? Because if only system admins see it, there's not a strong reason to ask for it in the first place; it just increases the burden on our security systems. Like storing credit card info unnecessarily. Just to play devil's advocate here.

On Tue, Aug 5, 2014 at 1:03 PM, shapironick notifications@github.com wrote:

I don't know much about what we can infer from HUD number and can't find anything other what what is up in the wiki as far as an explanation. I'm going to try to find out more when I'm back in the states. What level of anonymity do you think would be sufficiently secure? I see full VIN/HUD numbers as similar to email, only visible to the system admin.

— Reply to this email directly or view it on GitHub < https://github.com/publiclab/wherewebreathe/issues/34#issuecomment-51235634>

.

— Reply to this email directly or view it on GitHub https://github.com/publiclab/wherewebreathe/issues/34#issuecomment-51239349 .

mmnoo commented 10 years ago

I am wondering if it makes sense to store VIN/HUD, but for export purposes it gets translated to a random number that is stored in a translation table that says which VIN/HUD belongs to which users. The public will see the random housing number which links which data records are from the same unit. That way we can open the non sensitive bits of our data, it will still contain info on which responses belong to which housing units (determined by a random number).

This is only a bit more secure though, and there is still an easy way for someone's privacy to be violated:

Say two people from the same household answer the survey. One of them is a geek, can pick out their answers from the data download (likely their answers are the most recent, or they have entered something very unique for one of their responses). They can then figure out their housing unit's randomly assigned ID, and use it to find the health info of their cohabitants.

Maybe with that in mind, we can just leave out housing unit identifiers for the data export all together? We could still collect VIN/HUD and run stats on the info, but that part of the data wouldnt be open for privacy purposes maybe?

TL;DR: maybe collect VIN/HUD, but it or a housing ID isnt included in our data export because it is a easy privacy breach even if the number is obscured completely

shapironick commented 10 years ago

These are great ideas.

I'll follow Jeff's lead on this one.

In an ideal world only one user would be allowed per vin/hud and the survey would be taken individually for each member of the household but within one account (maybe this should be tackled when we talk about repeating questions?), so the first compromise wouldn't be possible.

jywarren commented 10 years ago

Unique ids should not be available in an anonymous data dump -- they make it very easy to reconstruct identity, esp. with timestamps. But the data dump will presumably not be anonymous? I need to get more involved in the data download discussion anyways but it hinges a bit on the big privacy discussion we're having, so I'll hold off for now. But we should distinguish an aggregate dump (where you don't get individual data or timestamps or unique ids at all, just the sums) and a complete bulk download. Melissa, if you're reading this, don't worry about it yet as we're still figuring it out.

We could hash the VIN/HUD so that nobody knows them, but they still act as unique ids. Hashing could be done in a non-reversible way, so the system can tell if 2 people enter the same # but we never actually store it.

BUT if we want to verify if it's a FEMA unit, we could store all the FEMA codes on the client side in javascript, ask them to enter their full VIN/HUD -- it'd match or not, and we only store whether it did or not.

So unless there are other reasons to store VIN/HUD, perhaps we can skip it?

jywarren commented 9 years ago

OK, so, after in-person discussion, we're going to save the VIN/HUD, securely, and add an explanation.