Attribute review - Githubissues

lxbarth commented 11 years ago

The script currently only populates addr:housenumber and addr:street

[x] Import GIS_ID -> dcgis:gis_id? Previous imports included this.
[x] Import addr:city and ZIPCODE -> addr:postcode? Previous imports did not include this.
[x] Make sure we're fine to not import or populate dcgis:captureyear, dcgis:lot, dcgis:square, source=dcgis, dataset=buildings

Example: building attributes from a previous import

Example: address attributes from a previous import

andinocl commented 11 years ago

@lxbarth is the thought that in keeping the dcgis_id, it's implied that the source is dcgis (and since we're combining two datasources it no longer makes sense to say "addresses" or "building" since we're conflating)? Lot and Square are tied to addresses, and can be messy for things like condos (where there's not a one-to-one relationship with the rooftop) and that makes sense to drop. Captureyear (or the gis pub date) could be useful in the future for checking against satellite imagery, or when running this script in maintenance mode in the future.

emacsen commented 11 years ago

I would forgo dcgis:gis_id or any dataset identifier. It was the best thinking at the time, but we've since realized that conflation with future datasets simply cannot rely on these attributes- for a variety of reasons. The solution when doing conflation is to use geometry, address or building name (the same as a mapper would use). It's more difficult but more reliable.

lxbarth commented 11 years ago

@emacsen -

but we've since realized that conflation with future datasets simply cannot rely on these attributes- for a variety of reasons.

Can you expand on this? Because to me that's not clear at all. The id's seem useful.

emacsen commented 11 years ago

Sure- and at this point, I should write this in a FAQ somewhere...

Over time, we've found that the hard part with doing imports in OSM isn't the initial import, but the conflation process. Going from no data to some data is easy, but going from some data to new data is hard.

So the obvious solution is to include some kind of upstream identifier. But what we've found, painfully, is that it doesn't work, and here's why (in no particular order):

1) In some cases, the identifiers change

This does not apply for DC, but in some localities, an object identifier will not be consistent across datasets, or will have some kind of historical issue, such as the problem TIGER has. This makes the conflation process even between the organization's own data hard to do. Including it in OSM only gives the illusion that this is useful data.

2) Conflation between OSM, or other datasets

Doing that second conflation means that you need to conflate objects from OSM. In the case of buildings, users may be tracing based on photography, from walking around, possibly even other imports.

None of these external sources will have the same locality identifier, so you will always need to use a conflation technique that does not rely on it, which ultimately means that you didn't really need it to start.

So if we don't need it, then we can use another technique to do the conflation. For buildings, I suggest geometry and address as a good start.

3) It makes OSMers less likely to work on the data

This is one where I wish I had some statistics on my side, but my experience (talking to people, and my own personal experience) has told me that when data appears imported, people are less likely to fix it. In my own experience, this is because I worry I'm touching the wrong thing, or that if I combine two objects, it will confuse the import system, etc.

That means imported data is going to be less likely fixed, which negates the point of doing an import (rather than a mixin).

4) It removes the idea that OSM is a collection of datasets

This is stylistic, but we have a problem where users think that OSM is a collection of data (sort of like wikimedia is a collection of stuff) and so they do these imports.

Removing those tags (assuming we can do conflation via some other means) will result in the data "feeling natural", or as Jason once put it "Indistinguishable from a human mapper".

I'm happy to revisit this view, but this is a change in my opinion thats taken years to come to. I didn't think this way a year ago, and certainly didn't have this view when I was doing the DC imports...

pnorman commented 11 years ago

I included a unique identifier when I did the Surrey address import and now mildly regret it. I couldn't use it for the update for reason 2 from above, users had since mapped new addresses (and duplicate addresses) which didn't have IDs so I had to write conflation logic that ended up ignoring IDs.

Something else you'll find is that the IDs will get mangled in unpredictable ways. Serge found TIGER tags edited in very strange ways when writing bot-mode.

If you do want to do an association between OSM ID and DC ID, the right way to do it is probably to capture it from the initial upload, either by looking at the of the changeset and applying logic, or parsing the

lxbarth commented 11 years ago

dcgis id

I would include the id as a an external reference point, just like we include adresses on buildings. I don't see the harm in this and I'm leaning towards including it.
I would not include the id for later conflation logic, I don't think that works as @pnorman and @emacsen laid out

I would love to discuss this on IM or voice.

Other identifiers

From discussion above I take it we're good to not include or populate:

dcgis:captureyear
dcgis:lot
dcgis:square
source=dcgis
dataset=buildings

ZIP

Any feedback on remaining:

[ ] Should we import addr:city and ZIPCODE -> addr:postcode? Previous imports did not include this.

emacsen commented 11 years ago

If I understand this comment, you're suggesting including the building id but not using it for conflation? If so, then can you help me understand why you want to include it?

I say for zipcode, I vote yes, because zip codes are really fuzzy and non-polygonal, so including them makes sense. addr:city is more "optional" but my take is that it doesn't hurt anything, and it makes parsing the address easier (eg if you get an extract of the buildings, then you don't have to do any secondary step to know the city) It also potentially reduces the problems associated with bad geocoding.

mikelmaron commented 11 years ago

Including dcgis:gis_id seems pretty reasonable to me, and useful for partial conflation. Assuming dcgis:captureyear is maintained by DC GIS later, we can simplify later conflation by working on just those building footprints in the data, and can more easily detect which buildings can simply be replaced with new data (they haven't changed in OSM) or need more human verification (they've changed in OSM in the mean time). It also allows for reporting community change back to DC GIS (which may not want it now, but if proven useful, could be interested later).

1) In some cases, the identifiers change. This does not apply for DC

Yea, if the identifiers are known to change, they're useless as identifiers. If DC identifiers are proper identifiers, not including them on the basis of a guessed perception about mappers and identifiers isn't justified.

andinocl commented 11 years ago

+1 @mikelmaron From an extremely novice OSM perspective, and having approached the mapping from a solo/anti-community standpoint, my view on editing imported data was always to look at it in context. For DC, the capturedate in particular was useful for me in determining how "fresh" the data was that I was proposing to mess around with. In near SE, for example, it helped to identify areas where I needed to look at the GIS data more closely or take a single GIS building and update it because I knew the GIS data to be out of date. Absent an import that automatically discards existing buildings, I could see the caputuredate being useful for one-off footprint updates. We don't know whether GIS_ID is going to be useful or not, but it might be -- the cost of importing it now is very low, the potential cost of not having it as a tool for conflation later is higher. For the rest (lot, square, feature code, etc), agree with dumping.

emacsen commented 11 years ago

not including them on the basis of a guessed perception about mappers and identifiers isn't justified.

My experience with mappers and imported data is and fairly consistent. I've talked to mappers who tell me explicitly that they don't touch imported data because they're concerned about "messing it up", where they don't have that same perception about the rest of the map. I have encountered datasets where I also have that same problem, and I'm not inexperienced nor am I timid. My advice, then is to avoid the potential for confusion. My question was if we aren't going to use the data for conflation- why include it at all...

Including dcgis:gis_id seems pretty reasonable to me, and useful for partial conflation

This only makes sense if you assume that users edit in completely preedictable ways. My experience with TIGER shows that this isn't how users edit the map.

Let me give you a few examples of how the DC data could be edited in ways that would disrupt a conflation process- these would be consistent with the kinds of things I saw with TIGER:

One simple example of how conflation can be interfered with by well meaning users is if they do a copy and paste. For example, someone might see a building that's similar to the one they're working with, or decide that a building is "closer to" the building they are working on, and instead of recreating the object, they move another building and modify it (so now the building has the right attributes but the "wrong" id). Or what if they copy and paste a building.

What if they combine two buildings, which building is "right"? And how would they know?

And what if they split one building into two buildings?

I know that these scenarios are less common with buildings than with with roads, but they did happen. We also had users remove attributes altogether, or edit them accidentally.

In NYC there's a dataset of imported bike racks where I'm not sure which bike rack I'm looking at, based on ID. If there are two bike racks on a street, I'm not sure which one I'm looking at, and if there's two in the dataset and one on the street- which one should I remove? I'm never sure, so I never edit them.

Or, as one final example, as we had in DC, thousand of buildings were on top of one another- sharing the same geometry (and in fact, sharing the same nodes). They had the same building ids too. It was a mess, and I spent a long, long time fixing them, one by one. I don't even know if they're all fixed yet...

Unless someone in DC is going to be consistently checking and curating, these errors will occur, either in DC or elsewhere. It's just inevitable based on how much data we're dealing with. But once it happens, then the conflation system will be "messed up" (that's the technical term).

My view is that knowing that this kind of thing happens, let's just avoid it causing problems by using more user-visible attributes for conflation (such as address and the actual geometries).

This is just a recommendation. I don't really have a stake in this. I used to think the upstream id was useful for conflation, then I dealt with TIGER, now I think it's not. But at the end of the day, I'm not going to be the one cleaning it up- I'm just giving my .02.

lxbarth commented 11 years ago

Per discussion last night import-us:

14 Include dc:gis
15 Expand suffixes and quadrants.

osmlab / dcbuildings

Attribute review #10

dcgis id

Other identifiers

ZIP

14 Include dc:gis

15 Expand suffixes and quadrants.