What is it?
Record linkage is the task of finding those records that represent the same real-world entity. It is also a crucial part of data integration where data from heterogeneous sources is combined to provide a unified view of the data. However, many challenges arise with the advent of big data. This introduces the idea of big data integration. In this case, data from many sources (structured, unstructured & semi-structured) and the dynamic nature of these sources pose a seemingly difficult task for this domain. This paper aims to highlight the trends in the field and techniques used to handle big data in the context of one of the dimensions of big data namely volume. The techniques cover the state-of-the-art as well as providing directions for future works.
How is it great compared to the related works?
To begin, traditional methods require a complete scan of database records to compute the similarity between them. This operation executes the cartesian product of similarity checks which is O(n2). For big datasets, this makes the process of entity resolution difficult; expensive, exhaustive, and error-prone. A much better method would partition the data in blocks where each block contains data items that likely to be similar. To achieve this, a blocking key is used. The blocking key is often created by the concatenation of the prefixes of the chosen attributes. Some blocking methods include Q-gram indexing, Standard blocking, and sorted neighborhood. The sorted neighborhood is one of the best methods as pointed out by various researchers. Next is the Map-Reduce programming model. Map-Reduce is also a shared-nothing programming model. It explores the idea of parallelism between worker nodes. Map-Reduce is similar to the previous method of partitioning except that this time data in each block is represented by key-value pairs. The Map function executes the partitioning of data into blocks whereas the Reduce function is used to collect and sort the final output. The next technique combines both ideas presented previously i.e., block-based Map-Reduce programming model. The idea here is that the Map function can read the data inputs in parallel according to the blocks while separating the data into smaller blocks of similar items. The reducer then collects the blocks and sorts them based on one or more similarity measures.
What are the key technical differentiators?
In this paper, the key technical differentiators discussed are BlockSplit and PairRange. Whereas BlockSplit reduces the search space of record-linkage (while evenly distributing the workload in the MapReduce model), PairRange range uses a load-balancing based approach for the MapReduce entity resolution.
How did they validate the advantages?
As this is a survey paper, no experiments were conducted and so this can be done in the next paper reading.
Are there any discussions around the proposal?
In the key technical differentiators discussed above, load-balancing is an area that still lacks a sufficient amount of research. The use of machine learning to select the best attributes for blocking is a future work to be considered in this respect.
What are the next papers to read?
The next paper reading will exploit one of the future aspects of this paper reading in the area of how machine learning is being leveraged to perform entity resolution.
Paper title: Record Linkage Approaches in Big Data: A State Of Art Study
Authors: Randa M. Abd El-Ghafar, Mervat H.Gheith and Ali H. El-Bastawissy
Topic tags: [Big Data], [Big Data Integration], [blocking], [entity matching], [entity resolution], [Hadoop], [machine learning], [MapReduce], [Record Linkage]
landing page of paper
What is it? Record linkage is the task of finding those records that represent the same real-world entity. It is also a crucial part of data integration where data from heterogeneous sources is combined to provide a unified view of the data. However, many challenges arise with the advent of big data. This introduces the idea of big data integration. In this case, data from many sources (structured, unstructured & semi-structured) and the dynamic nature of these sources pose a seemingly difficult task for this domain. This paper aims to highlight the trends in the field and techniques used to handle big data in the context of one of the dimensions of big data namely volume. The techniques cover the state-of-the-art as well as providing directions for future works.
How is it great compared to the related works? To begin, traditional methods require a complete scan of database records to compute the similarity between them. This operation executes the cartesian product of similarity checks which is O(n2). For big datasets, this makes the process of entity resolution difficult; expensive, exhaustive, and error-prone. A much better method would partition the data in blocks where each block contains data items that likely to be similar. To achieve this, a blocking key is used. The blocking key is often created by the concatenation of the prefixes of the chosen attributes. Some blocking methods include Q-gram indexing, Standard blocking, and sorted neighborhood. The sorted neighborhood is one of the best methods as pointed out by various researchers. Next is the Map-Reduce programming model. Map-Reduce is also a shared-nothing programming model. It explores the idea of parallelism between worker nodes. Map-Reduce is similar to the previous method of partitioning except that this time data in each block is represented by key-value pairs. The Map function executes the partitioning of data into blocks whereas the Reduce function is used to collect and sort the final output. The next technique combines both ideas presented previously i.e., block-based Map-Reduce programming model. The idea here is that the Map function can read the data inputs in parallel according to the blocks while separating the data into smaller blocks of similar items. The reducer then collects the blocks and sorts them based on one or more similarity measures.
What are the key technical differentiators? In this paper, the key technical differentiators discussed are BlockSplit and PairRange. Whereas BlockSplit reduces the search space of record-linkage (while evenly distributing the workload in the MapReduce model), PairRange range uses a load-balancing based approach for the MapReduce entity resolution.
How did they validate the advantages? As this is a survey paper, no experiments were conducted and so this can be done in the next paper reading.
Are there any discussions around the proposal? In the key technical differentiators discussed above, load-balancing is an area that still lacks a sufficient amount of research. The use of machine learning to select the best attributes for blocking is a future work to be considered in this respect.
What are the next papers to read? The next paper reading will exploit one of the future aspects of this paper reading in the area of how machine learning is being leveraged to perform entity resolution.