Data Quality - Githubissues

shuijian-xu commented 5 years ago

Data can be:

Missing (or partially missing)
Erroneous
Out of date
Inconsistent
Misleading

shuijian-xu commented 5 years ago

we have to have some defined process for dealing with records where the values are missing. There are two main approaches:

Use a default value.
Reject the record.

shuijian-xu commented 5 years ago

Use of default values means that the rejection of records can be avoided. They can be used quite effectively on enumerated or text fields and in foreign key fields. We can simply use the term unknown as our default. In the case of foreign keys, we then need to add a record in the lookup table with a primary key of “UN” and a corresponding description.

shuijian-xu commented 5 years ago

If we have, say, 50 different hobbies it is easy to imagine adding a further “unknown” category. In the previous example, salary, how can we possibly define a default value, other than null, to be used when the column has continuous values? We cannot apply a text value of “unknown” to a numeric field. The only way to deal with this effectively is to prevent such a record from getting into the warehouse. So we reject them.

Rejecting records means that they are not allowed into the warehouse in their current state. There are three main approaches to the rejection of records:

Permanent rejection. In this situation, the record is simply thrown away and never gets into the warehouse. The obvious problem with this solution is that there is a compromise of simplicity of operation versus accuracy of information. Where the volumes are tiny in proportion to the whole, this might be acceptable but extreme care should be exercised. This type of solution can be applied only to behavioral records, such as sales records, telephone calls, etc. It should not be applied to records relating to circumstances. For instance, if we were to reject a customer record permanently because the address was missing, then every subsequent sale to that customer would also have to be rejected because there would be no customer master record to match it up to.
Reject for fixing and resubmission. With this approach, the offending records are sidelined into a separate file that is subjected to some remedial processing. Once the problems have been fixed, the records are reentered into the warehouse. The simplest way to do this is to merge the corrected records with the next batch of records to be collected from the operational system so that they get automatically revalidated as before. This sounds like a neat solution and it is. Beware however. Firstly, if we take this approach, it means that we have to build a set of processes, almost a subsystem, to handle the rejections, allow them to be modified somehow, and place them back in the queue. This system should have some kind of audit control so that any changes made to data can be traced, as there is a clear security implication here. The mechanism we might adopt to correct the records could be a nice windows-style graphical user interface (GUI) system where each record is processed sequentially by one or more operators. Sometimes the rejection is caused by an enhancement or upgrade to the source system that ever so slightly changes the records that are received by the data warehouse. When this happens, two things occur. First, we do not see the odd rejection; we see thousands or millions of rejections. Second, if the format of the records has changed, they are unlikely to fit into the neat screen layouts that we have designed to correct them. The reject file contains some true records and a whole pile of garbage. This situation really requires a restore and a rerun. Restores and reruns are problematic in data warehousing. We will be covering this subject later in the chapter.
Reject with automatic resubmission. This is a kind of variation on the previous type of rejection. Sometimes, simply resubmitting the offending record is sufficient. An example of this occurs frequently in mobile telecommunication applications. The customer Lucie Jones goes into her local telephone store, orders a new service, and 15 minutes later leaves the store with her new possession. Almost immediately the telephone is activated for use and Lucie is able to make calls. That evening, all the calls that Lucie has made will be processed through the mediation system through to billing and will then be transferred to the data warehouse. However, Lucie's circumstances (customer details), name, address, billing information, etc., are unlikely to find their way to the data warehouse in time to match up to her telephone calls. The missing data, in this example, are Lucie's customer details. The result is that Lucie's calls will be rejected by the warehouse as it does not have a customer with whom to match them up. But we don't want them to be passed to some manual remedial process to correct them. What we need is for the calls to be recycled and re-presented to the validation process until such time as Lucie's master customer record turns up in the data warehouse. When this happens, the calls will be accepted without further intervention. Clearly there has to be a limit to the number of cycles so that records that will never become valid can be dealt with in another way.

shuijian-xu commented 5 years ago

Erroneous Data.

Whereas missing data is easy to spot, data that is present but that is simply wrong can sometimes be harder to detect and deal with. Again, there are several types of erroneous data:

Values out of valid range. These are usually easy to recognize. This type of error occurs when, say, the sex of a customer, which should have a domain of “F” or “M” contains a different value, such as “X.” It can also apply to numerical values such as age. Valid ages might be defined as being between 18 and 100 years. Where an age falls outside of this range, it is deemed to be in error.
Referential errors. This is any error that violates some referential integrity constraint. A record is presented with a foreign key that does not match up to any master record. This is similar to the problem we described in the previous section. This could occur if, say, a customer record were to be deleted and subsequent transaction records have nothing to match up to. This kind of problem should not be encountered if the guidelines on deletions, which were described in Chapter 6, are followed. However, this does not mean that referential integrity errors will not occur; they will. How they are dealt with is a matter of internal policy, but it is strongly recommended that records containing referential errors are not allowed into the warehouse because, once in, they are hard to find and will be a major cause of inconsistency in results.

The best way to deal with this type of referential integrity error is to reject the records for remedial processing and resubmission.

Valid errors. These are errors in the data that are almost impossible to detect: values that have simply been entered incorrectly, but that are still valid. For instance, a customer's age being recorded as 26 when it should be 62. Typically, errors like this are made in the operational systems that supply the data warehouse and it is in the operational systems that the corrections should be made. In many cases, corrections are ad hoc, relying on sharp-eyed operators or account managers who know their customers and can sometimes, therefore, spot the errors.

We can sometimes take a more proactive approach by, periodically, sending data sheets to our customers that contain their latest details, together with an incentive for them to return the information with any corrections.

shuijian-xu commented 5 years ago

Out-of-Date Data. Out-of-date information in a data warehouse is generally the result of too low a frequency of synchronization of customers' changing circumstances between the operational systems and the warehouse. An example is where a customer's change of address has been recorded in the operational system but not yet recorded in the data warehouse. We should try to receive changes from the operational systems into the warehouse as frequently as possible.

shuijian-xu commented 5 years ago

Inconsistent Data. Inconsistent data in the data warehouse is a result of dependencies, or causal changes, of which the operational systems may be ignorant. For instance, we may be maintaining a derived regional segment based on customer addresses. This segmentation is held only within the data warehouse and is not known by the operational system. When a customer moves from one address to another, eventually the data warehouse has to be informed about this and will, routinely, make the appropriate adjustment to the customer's circumstances. The regional segment, and any other address-dependent segments, must also be updated. This is a change that is solely under the control of the data warehouse and its management and systems. This can be quite a headache because the segments are dynamic. New ones appear and old ones disappear. The causal effect of changes to circumstances, on the various segments that are dependent on them, is a hard nut to crack.

shuijian-xu commented 5 years ago

Misleading Data. Misleading data can occur when apparent retrospective changes are made and the representation of time is not correct. An example is where a change of address occurs and, due to the incorrect application of retrospection, the new address is erroneously applied to historical behavioral data. Oddly enough, the reverse situation also applies. When describing so-called valid errors, the example was where the customer's age was recorded as 26 but should have been 62. If we had classified the age of the customer as having true retrospection, then any correction might have been implemented such that, at a particular point in time, the customer's age jumped from 26 years to 62 years. In this case, it is the “proper” use of retrospection that causes the data to be misleading.

shuijian-xu / hive

Data Quality #41