mattyschell / cscl-subaddress-matched

Creative Commons Zero v1.0 Universal
0 stars 0 forks source link

Determine strategy for duplicates of hyphen type Queens #10

Closed mattyschell closed 3 years ago

mattyschell commented 3 years ago

In issue #4 we force replaced all subaddresses associated with ranged addresses. Our motiviation is to uniquely ID all subaddress records using (sub_address_id, melissa_suite, usps_hnum).

However duplicate subaddress rows remain. They are associated with address points of hyphen type "Queens." What should we do with these remaining duplicates?

mattyschell commented 3 years ago

There's also some wackiness like this which is just bad data and should be cleaned up. These duplicates are unrelated to the Queens type address points described in this issue or anything that can be resolved using this repository.

select count(*) from subaddress where ap_id = 1011588 and melissa_suite = 'BLDG 110' and usps_hnum is null

600

lblastinger commented 3 years ago

At present, I believe there are about 646k records that have USPS_HNUM values because they were imported from the MELISSA Update. The remaining 2.4M have USPS_HNUM of Null. For the records that do not have 1-to-1 relationships to Address Point + Suite between MELISSA and SUBADDRESS, we anticipate that these relationships involve a ranged address point that happens to have a duplicate Suite value for multiple addresses.

For example: ADDRESSPOINT.ADDRESSPOINTID = 12345, has ADDRESSPOINT.HYPHEN_TYPE = R, and ADDRESSPOINT.HOUSE_NUMBER = "111-115".

The MELISSA table, on the other hand, has two records that refer MELISSA.ADDRESSPOINTID = 12345 and MELISSA.SUITE = "APT A" (But it also has MELISSA.HNUM = 111 for the first record and MELISSA.HNUM = 113 for the second record.)

These will have to be resolved manually by counting the number of relevant joins to each address point from MELISSA.ADDRESSPOINTID and MELISSA.SUITE combinations. Anything greater than 1 join will need to be handled by manually filling in the USPS_HNUM for the relevant record: AKA in the example above, SUBADDRESS.SUBADDRESSID = 4041, SUBADDRESS.ADDRESSPOINTID = 12345, SUBADDRESS.SUITE = APT A SUBADDRESS.USPS_HNUM = 111 (<- Manually fill this in)

SUBADDRESS.SUBADDRESSID = 4042 SUBADDRESS.ADDRESSPOINTID = 12345 SUBADDRESS.SUITE = APT A SUBADDRESS.USPS_HNUM = 113 (<- Manually fill this in)

Hope this helps!

mattyschell commented 3 years ago

Very helpful thanks!

We may want to again take the approach of bulk deleting and replacing any remaining ranged address points. This would let us avoid any manual work, an important tradeoff given resource levels at the moment.

We'll discuss and keep you posted.

lblastinger commented 3 years ago

Thanks Matt.

Also, the python coder in me starts thinking that a loop could be created that deposits unique MELISSA.USPS_HNUM values into a list for each combination of identical MELISSA.ADDRESSPOINTID and MELISSA.SUITE values.

Those listed values could then be arbitrarily applied to each corresponding SUBADDRESS.USPS_HNUM record that matches the Suite and AddressPointID values in ascending order.

That could be an automatic way to do it that would also not distinguish between 1-to-1 and 1-to-many relationships.

mattyschell commented 3 years ago

Some data points:

create table subaddress_dupes 
as 
select 
    ap_id as ap_id
   ,melissa_suite as melissa_suite
   ,usps_hnum as usps_hnum
   ,count(*) as kount
from 
    subaddress
group by 
    ap_id
   ,melissa_suite
   ,usps_hnum
having count(*) > 1
select a.hyphen_type
      ,count(a.hyphen_type)
from 
    addresspoint a
join 
   subaddress_dupes b
on 
    a.addresspointid = b.ap_id
group by 
    a.hyphen_type
HYPHEN_TYPE COUNT(A.HYPHEN_TYPE)
N 10607
Q 4097
mattyschell commented 3 years ago
select a.hyphen_type
      ,count(a.hyphen_type)
from 
    addresspoint a
join 
   subaddress_dupes_q b
on 
    a.addresspointid = b.ap_id
group by 
    a.hyphen_type
HYPHEN_TYPE COUNT(A.HYPHEN_TYPE)
Q 4097
mattyschell commented 3 years ago

We bulk deleted and replaced all subaddresses associated with hyphen type Q. We also deleted all duplicate subaddresses of type N.

This issue is only partially related to this repository because we expect to uniquely ID records with composite business key (address point, suite, house number) and these duplicates violated that constraint.