mattyschell / cscl-subaddress-matched

Creative Commons Zero v1.0 Universal
0 stars 0 forks source link

Determine a strategy for ranged address points #4

Closed mattyschell closed 3 years ago

mattyschell commented 3 years ago

Ranged ("range"?) addresses do not produce unique address point, melissa suite records.

"range address" address point id suite
101-103 999 Apt 1
101-103 999 Apt 2
101-103 999 Apt 1
101-103 999 Apt 2

These records can be uniquely ID'd with the new house number populated.

"range address" address point id suite house number
101-103 999 Apt 1 101
101-103 999 Apt 2 101
101-103 999 Apt 1 103
101-103 999 Apt 2 103

This strategy is less helpful for existing subaddress records. Our tentative proposal for existing ranged addresses is to

  1. Identify ranged addresses that suffer from duplicate subaddress address point id and suite
  2. Delete only these duplicate-afflicted ranged address subaddress records
  3. Add these records from the melissa delivery into CSCL subaddress with new sub_address_ids and a unique address point, suite, house number combination
mattyschell commented 3 years ago

The term to use is "Ranged Address Point" not ranged addresses.

mattyschell commented 3 years ago

Reviewing the metadata:

https://github.com/CityOfNewYork/nyc-geo-metadata/blob/master/Metadata/Metadata_AddressPoint.md#3-attribute-information

It looks like "hyphen type" on address point indicates whether or not an address point is a ranged address point. Are the following "hyphen type"s all ranged address point types that should fall under the proposal in this issue?

Value Meaning Count Count with Subaddresses Sample House Number Sample House Number Range
Q Queens Type 338,129 60,314 69-023 null
R Building Range 16,326 3,024 251 253
U Unit 627 20 10-123 null
X Range of Queens Style 2,284 288 150-012 150-014
mattyschell commented 3 years ago

The experts tell me that only R and X indicate ranged address points. The populated "House Number Range" column, with range there in the column name, is an important clue.

Strategy proposal: Allow the processing of this repository to start with a pre-populated subaddress_delete list of subaddress IDs. Manually populate this list at any time with subaddress IDs that we wish to replace. We will delete these IDs from the input subaddress records and also pass these IDs through to the output so that they will be deleted in the target CSCL.

The strategy should work for any process where, for some reason, we wish to replace subaddress records with new values.