google-code-export / sandy-disaster-recovery

Automatically exported from code.google.com/p/sandy-disaster-recovery
2 stars 2 forks source link

Duplicate Detection and Non-Destructive Merging #83

Open GoogleCodeExporter opened 9 years ago

GoogleCodeExporter commented 9 years ago
We need to detect duplicates and give people an opportunity to merge.  This is 
a very very high priority.  I was two-thirds done writing use cases explaining 
how to handle various cases and exceptions.  And then my computer rebooted 
itself and I lost everything. 

But I need something in here as a placeholder so here it is.

Original issue reported on code.google.com by v...@aarontitus.net on 15 Nov 2012 at 7:54

GoogleCodeExporter commented 9 years ago

Original comment by v...@aarontitus.net on 18 Nov 2012 at 5:31

GoogleCodeExporter commented 9 years ago
Merging happens at two stages:
1. Entry: A volunteer conducts an assessment, and the client already has a work 
order in the system. The system should detect the existing work order, and give 
an opportunity to use the existing work order instead of creating a new work 
order.
2. Audit: A volunteer finds two work orders that are obviously duplicates, and 
wants to merge them without destroying any relevant information.

Merge Feature- for duplicate work orders already in the system, combine and 
keep both work order numbers.  If not the same, then consider changing the 
lat/long by 100 feet so that the icons aren’t on top of one another.  Just an 
idea- not required.

Original comment by v...@aarontitus.net on 14 Jan 2013 at 3:02

GoogleCodeExporter commented 9 years ago
The duplicate detection engine should also work during the Import process 
(Issue 133).

Original comment by v...@aarontitus.net on 18 Mar 2013 at 12:19

GoogleCodeExporter commented 9 years ago
For ENTRY duplicate detection, I think there may be a lightweight pre-duplicate 
detection validation we could implement by applying the same AJAX we use in the 
map search feature (Issue 63).  For example, if I type, "Scot" in the name, 
address, city or phone number, then the following results may automatically 
appear under the entry.  By selecting one of them, I would be taken directly to 
the "Edit" page:
<Scott Smith, 123 Main Street, Union City NJ 01234>
<Jim Jones, 476 Scotch Plains Ave, Rockaway, NY 12345>
<Fria Lima, 9377 Home St, Scottsdale, CT 23905>

This would require that the map AJAX be loaded each time someone enters an 
assessment form.  We may want to find a way to cache that once per session for 
the assessment form.  But it seems like a quick way to minimize entry 
duplicates.

NOTE: This is NOT system-driven duplicate detection, and this process will not 
detect duplicates imported 

Original comment by v...@aarontitus.net on 18 Mar 2013 at 12:30

GoogleCodeExporter commented 9 years ago
Agreed - this is possible if the site data 'terms' used in the search 
autocomplete are made available on the form .

The search terms can cached to local storage in a cross-compatible-browser way 
and then queried as you suggest. I will try this.

Original comment by cpw...@gmail.com on 18 Mar 2013 at 4:52

GoogleCodeExporter commented 9 years ago
A possible behaviour for #4 is now live on the testbed (v155).

(There is no caching to local storage - the map page doesn't do any itself - 
added as issue 219).

Original comment by cpw...@gmail.com on 18 Mar 2013 at 11:12

GoogleCodeExporter commented 9 years ago
I say "possible" because I can think of a few different ways of doing this, but 
it's not obvious what would be best wrt UI - so see what you think.

Original comment by cpw...@gmail.com on 18 Mar 2013 at 11:13

GoogleCodeExporter commented 9 years ago
Having played with #4 more, the behaviour of filling fields is annoying at 
best! It needs to get out of the way if the user isn't interested.

I will disable it and try to improve it.

Original comment by cpw...@gmail.com on 19 Mar 2013 at 3:40

GoogleCodeExporter commented 9 years ago
Backend-powered duplicate detection is now live on the testbed (v156), using 
double metaphones [1], for the form only.

Try entering misspelled(ish) names and addresses, and also note the new "Ignore 
similar matches" checkbox

Not currently integrated with importing CSVs.

[1] http://en.wikipedia.org/wiki/Metaphone

Original comment by cpw...@gmail.com on 19 Mar 2013 at 5:51

GoogleCodeExporter commented 9 years ago
Entry suggestions (#4) are temporarily disabled - split out to issue 220 to 
keep this one about the backend.

Original comment by cpw...@gmail.com on 19 Mar 2013 at 5:53

GoogleCodeExporter commented 9 years ago

Original comment by cpw...@gmail.com on 19 Mar 2013 at 5:54

GoogleCodeExporter commented 9 years ago
Similarity matching algorithm as of v156 (from doc of find_similar()):

    Two sites are similar if at least one of:
    (i) Their normalised phone numbers are the same.
    (ii) Their name and address metaphones and digits in address all match.

Note: current similarity algorithm *does not* check city field due to use of 
synonyms in the field (e.g. "New York" vs "New York City" vs "NYC")

Original comment by cpw...@gmail.com on 19 Mar 2013 at 6:24

GoogleCodeExporter commented 9 years ago
Duplicate detection during importing CSVs is live on the testbed (v156).

Login as admin and see e.g. 
https://sandy-helping-hands.appspot.com/admin-import-csv/active?id=280002

Original comment by cpw...@gmail.com on 19 Mar 2013 at 6:39

GoogleCodeExporter commented 9 years ago
Can two sites be similar if their lat/lon combination are within a ~25 feet of 
one another?  How hard would that be to do?

Original comment by v...@aarontitus.net on 19 Mar 2013 at 11:00

GoogleCodeExporter commented 9 years ago
Proximity searches as a kind of geospatial query are possible, e.g. as per 
https://developers.google.com/maps/articles/geospatial

The hack way of doing it is to quantize the co-ordinates to a grid and search 
on grid co-ordinates, which is strictly less accurate.

Either way, how should it fit in with the algorithm? Should it take the place 
of the address-related similarity matching in the algorithm in #12?

Original comment by cpw...@gmail.com on 20 Mar 2013 at 10:15

GoogleCodeExporter commented 9 years ago
It seems to me that geospatial proximity searches will be the most accurate.  
Sometimes you might have a case worker as the primary contact, or a neighbor 
with a working phone.

I think that the geospacial query should be first in the algorithm, and that 
address-related similarity should be secondary. I still would like the address 
and phone number check, just in case Google geocodes incorrectly.

Does that answer your question?

Original comment by v...@aarontitus.net on 20 Mar 2013 at 3:59

GoogleCodeExporter commented 9 years ago
Yes, but I think using geocoded co-ords will be all-or-nothing - almost every 
addresses geocodes to somewhere (and if it doesn't, the work order gets 
rejected in the import csv case). So there's no need to consider the metaphonic 
similarity of address if it doesn't geocode.

Regarding similar names, maybe drop/ignore this? What are the cases?

Original comment by cpw...@gmail.com on 20 Mar 2013 at 11:03

GoogleCodeExporter commented 9 years ago
Yes, while virtually all addresses will geocode, my experience has been that 
they will not always geocode correctly.  This is often due to user error.  
There have been several instances when the geocoordinates were set to 0,0 (no 
idea how this happened- assume User error).  There are other times when the 
user will enter information incorrectly, e.g. "Rolly Rd." instead of "Raleigh 
Rd."  In these instances, Google will make an educated guess.  Most of the time 
Google guesses right. Sometimes it says, "I don't know," but sometimes it 
guesses wrong, and the user blindly accepts Google's guess.
Keeping the name and phone number check as secondary reminders would be very 
helpful, especially when information is entered by a live person.

Original comment by v...@aarontitus.net on 20 Mar 2013 at 11:54

GoogleCodeExporter commented 9 years ago
New algorithm, using geospatial proximity search, is:

    Two sites are similar if at least one of:
    (i) Their geocoded co-ords are within 8 metres of each other.
    (ii) Their name metaphone and normalised phone numbers match.

(geo index under batch construction now)

Original comment by cpw...@gmail.com on 25 Mar 2013 at 3:59

GoogleCodeExporter commented 9 years ago
Live now on production - version 159.

Original comment by cpw...@gmail.com on 25 Mar 2013 at 4:08

GoogleCodeExporter commented 9 years ago
Cool.  Comments:
* Change the error message to, "It looks like [Name] ([Address]) is already in 
the system as [Work Order Number]. Is it a duplicate? [link: When clicked, an 
InfoBox appears]View[/link].
Yes: [link]Edit existing record[/link].
No: This is not a duplicate. [link]Continue[/link].

Original comment by v...@aarontitus.net on 25 Mar 2013 at 4:24

GoogleCodeExporter commented 9 years ago
Also, I think 8 meters is too much. In tightly-packed homes (e.g. on breezy 
point), some lots aren't much more than about 5 meters across. I'd like to bump 
the sensitivity down to about 4 meters, if that's OK.

Original comment by v...@aarontitus.net on 25 Mar 2013 at 4:27

GoogleCodeExporter commented 9 years ago
Unexpected bug: In the infobox, the following items appear:
City metaphone: NPTN-. Name metaphone: PPTSNKTS. Address metaphone: RFRFKRT. 
Phone normalised: 7328046041.

Is there any way we can hide that information from the InfoBox?

Original comment by v...@aarontitus.net on 25 Mar 2013 at 4:29

GoogleCodeExporter commented 9 years ago
At present, Infoboxes are only available for display on the map view, not the 
form. view. 

As a stop gap, how about opening a new tab on click of the error message?

Original comment by cpw...@gmail.com on 25 Mar 2013 at 4:53

GoogleCodeExporter commented 9 years ago
Sure. That's an OK stop gap for the Infobox issue.
I worry about someone who has interviewed a person over the phone for 8 
minutes, clicks enter, only to find that it's a duplicate, then when they edit, 
all of their notes from the past 8 minutes are lost.

At least opening up the view/ edit would allow them to save their work.

Original comment by v...@aarontitus.net on 25 Mar 2013 at 4:59

GoogleCodeExporter commented 9 years ago
Understood re losing data.

New warning message, 4 metre radius, infobox alteration now live on testbed.

Original comment by cpw...@gmail.com on 25 Mar 2013 at 5:12

GoogleCodeExporter commented 9 years ago
On sandbox, please check A3423 and A1861.  They are duplicates, but were not 
flagged as duplicates.
(107 Wakewood Rd)
Any ideas?

Original comment by v...@aarontitus.net on 25 Mar 2013 at 5:33

GoogleCodeExporter commented 9 years ago
The saved co-ordinates are different, so there's 1.5 km between them:

https://sandy-helping-hands.appspot.com/edit?case=A1861
https://sandy-helping-hands.appspot.com/edit?case=A3423

=> 

http://www.wolframalpha.com/input/?i=distance+between+%2840.2091219%2C+-74.03862
71%29+and+%2840.196577%2C+-74.044195%29&a=*C.distance-_*GeoQueryType-

Original comment by cpw...@gmail.com on 25 Mar 2013 at 5:47

GoogleCodeExporter commented 9 years ago
A1861 could have used old geocoding data, at the point it was geocoded?

It geocodes to the same as A3423 when rerun now (my clicking in to an address 
field).

Original comment by cpw...@gmail.com on 25 Mar 2013 at 5:50

GoogleCodeExporter commented 9 years ago
That's strange; especially since the two appeared on the map at exactly the 
same location.  Perhaps the system does not actually use the lat/lon fields to 
map? Perhaps the Google Maps API just reads the address de novo each time.  I 
dunno... it's a strange behavior.

Here's something else interesting-  I allowed Google maps to re-geocode A1861.  
As expected, it geocoded the same as A3423.  I saved it, and it did not flag as 
a duplicate (probably because I am editing, rather than creating a new record).

It would be good to check for duplicates on save, as well. Allow the save to 
occur, and provide links to each of the duplicates.  Then you can (optionally) 
go edit them and change the status to "Closed, duplicate"

It's a good way to clean up the database as we move forward.

Original comment by v...@aarontitus.net on 25 Mar 2013 at 6:28

GoogleCodeExporter commented 9 years ago
I see - it's usefully different behaviour. Ok, will make a change to edit-saves.

Original comment by cpw...@gmail.com on 25 Mar 2013 at 7:04

GoogleCodeExporter commented 9 years ago
Currently, after an saving an edit, the map page is shown with the infobox of 
the work order just saved.

Three options then:
(i) Show suspected duplicates on this infobox,
(ii) Show suspsected duplicates on *all* infoboxes (simpler; also more 
powerful?),
(iii) introduce new UI

Original comment by cpw...@gmail.com on 25 Mar 2013 at 7:16

GoogleCodeExporter commented 9 years ago
(i) seems a little crowded, but theoretically possible.
(ii) I can't exactly visualize how this would work, but I like simpler and more 
powerful.
(iii) Seems unnecessary.

Original comment by v...@aarontitus.net on 25 Mar 2013 at 7:33

GoogleCodeExporter commented 9 years ago
Changed "To ignore this warning, select the checkbox at the bottom of this 
form." -> "Ignore & Save" button.

Deployed to testbed.

Original comment by cpw...@gmail.com on 25 Mar 2013 at 8:01

GoogleCodeExporter commented 9 years ago
Much better. I like the Ignore & Save button.

Original comment by v...@aarontitus.net on 25 Mar 2013 at 8:13

GoogleCodeExporter commented 9 years ago
Important caveat to Comment 19 algorithm:

If {the address line (not State) contains any one of the following non-case 
sensitive complete words (e.g. having a line break or space before and after): 
["#", "Suite", "Ste", "Apartment", "Apt", "Unit", "Department", "Dept", "Room", 
"Rm", "Floor", "Fl", "Bldg", "Building", "Basement", "Bsmt", "Front", "Frnt", 
"Lobby", "Lbby", "Lot", "Lower", "Lowr", "Office", "Ofc", "Penthouse", "Pent", 
"PH", "Rear", "Side", "Slip", "Space", "Trailer", "Trlr", "Upper", "Uppr"]
Then assume that it's an apartment and therefore {Two sites are similar if 
{Their name metaphone and normalised phone numbers match.}}

Else {Two sites are similar if at least one of:(
    (i) Their geocoded co-ords are within 8 metres of each other.
    (ii) Their name metaphone and normalised phone numbers match.)}

Let me know if this makes sense.

Original comment by v...@aarontitus.net on 26 Mar 2013 at 1:27

GoogleCodeExporter commented 9 years ago
If an address value implies an apartment, don't check the geocoded co-ords - 
ignore address look at name and phone number only?

Original comment by cpw...@gmail.com on 26 Mar 2013 at 11:27

GoogleCodeExporter commented 9 years ago
Correct.  On further reflection, 1. Ignore geocoordinates. 2. Compare address 
text for exact match (if easy). 3. Look at name and phone.

Is #2 feasible/ easy? If not, don't worry about it; stick with 1 and 3.

Original comment by v...@aarontitus.net on 26 Mar 2013 at 2:05

GoogleCodeExporter commented 9 years ago
Address comparison is already in the codebase (from the previous design) but 
disabled.

Comparing is done like the name comparison plus a check of the digits - e.g. 
"15 park ave" is the same as "15 parcav" but not the same as "16 park ave".

Would exact match be better? I'm not sure of the cases.

Original comment by cpw...@gmail.com on 26 Mar 2013 at 2:42

GoogleCodeExporter commented 9 years ago
I think comparing the address sounds ("park ave" and "parcav") plus the digits 
makes the most sense.
If that's straightforward, we can do that and close this issue.

Original comment by v...@aarontitus.net on 26 Mar 2013 at 4:01

GoogleCodeExporter commented 9 years ago
This change is being deployed now. The new algorithm is:

    Two sites are similar if at least one of:
    (i) The addresses imply an apartment and the addresses and names have
        matching metaphones.
    (ii) The addresses do not imply an apartment and the geocoded co-ords
         are within 4 metres of each other.
    (iii) Their name metaphone and normalised phone numbers match.

Original comment by cpw...@gmail.com on 2 Apr 2013 at 2:12

GoogleCodeExporter commented 9 years ago
Deployed to live/production.

Original comment by cpw...@gmail.com on 2 Apr 2013 at 2:41

GoogleCodeExporter commented 9 years ago
Awesome!  Thanks!

Original comment by v...@aarontitus.net on 2 Apr 2013 at 2:56

GoogleCodeExporter commented 9 years ago
If this is done, go ahead and mark it completed.  I'm pretty sure it's done.

Original comment by v...@aarontitus.net on 13 Apr 2013 at 12:31