leerssej / google-refine

Automatically exported from code.google.com/p/google-refine
Other
0 stars 0 forks source link

Compare two lists of companies to find duplicates #211

Closed GoogleCodeExporter closed 8 years ago

GoogleCodeExporter commented 8 years ago
Enhancement request:

A common problem while importing a list of companies into a database is to have 
the companies rejected due to some duplicate detection algorithms, or others 
allowed since the names were close (but not exact duplicate).

It would be nice to be able to compare two list of company names (1 - master 
list from database, 2 - List to be imported) and show the potential duplicates 
based on more criteria.

Types of common duplicates are: 
- Not having same suffix (Inc., Ltd, etc)
- Incorrect spelling (Similar to SoundEx algorithm but probably taking into 
account longer names)
- Some of the company names are abbreviated and other are not. (Ex. IBM, 
I.B.M., International Business Machines) - I think refine can probably already 
do this.

Thanks,

Rob.

Original issue reported on code.google.com by robertg...@gmail.com on 15 Nov 2010 at 3:14

GoogleCodeExporter commented 8 years ago
This might be achievable by having the master list in a separate Refine project 
from the import list, and using the Cross function 
http://code.google.com/p/google-refine/wiki/GRELOtherFunctions

More complexity would require for a reconcilation service between projects, see 
issue 176 http://code.google.com/p/google-refine/issues/detail?id=176

Original comment by iainsproat on 15 Nov 2010 at 3:21

GoogleCodeExporter commented 8 years ago
Also, a quick way to generate an abbreviation is

forEach(value.replace(/\W/, " ").replace(/\s+/, " ").split(" "), v, 
v[0].toUppercase()).join(".")

Original comment by dfhu...@gmail.com on 15 Nov 2010 at 5:37

GoogleCodeExporter commented 8 years ago
Comparing entities from different sources to match records that may refer to 
the same is a branch of computer science with 40 years research history, 
normally referred to as "record linkage".  

GR is the easiest tool I have seen for doing record linkage, but even more 
advanced tools exists: 
http://en.wikipedia.org/wiki/Record_linkage#External_links

Original comment by haraldgr...@gmail.com on 22 Dec 2010 at 9:44