shuijian-xu / hive

0 stars 0 forks source link

Problems Involving Time? #89

Open shuijian-xu opened 5 years ago

shuijian-xu commented 5 years ago

Where the dimensions in a dimensional model are large (some organizations have several million customers), the capture of the data followed by the transfer to the data warehouse environment and subsequent comparison is a process that can be very time-consuming. Consequently, most organizations place limits on the frequency with which this process can be executed. At best, the frequency is weekly. The processing can then take place over the weekend when the systems are relatively quiet and the extra processing required to perform this exercise can be absorbed without too much of an impact on other processing. Many organizations permit only monthly updates to the dimensional data, and some are even less frequent than that.

The problem is that the only transaction time available, against which the changes can be recorded, is the date upon which the change was discovered (i.e., the file comparison date). So, for example, let us assume that the frequency of comparison is monthly and the changes are captured at the end of the month. If a customer changes address, and geographic region, at the beginning of the month, then any facts recorded for the customer during the month will be credited permanently to the old, incorrect region.

It is possible that, during a single month, more than one change will occur to the same attribute. If the data is collected by the file comparison method, the only values that will be captured are those that are in existence at the time of capture. All intermediate changes will be missed completely.

shuijian-xu commented 5 years ago

Identifying and capturing the temporal requirements. The first problem is to identify the temporal requirements. There is no method to do this currently. The present data warehousing modeling techniques do not provide any real support for this.

shuijian-xu commented 5 years ago

Capture of dimensional updates. What happens when a relationship changes (e.g., a salesperson moves from one department to another)? What happens when a relationship no longer exists (e.g., a salesperson leaves the company)? How does the warehouse handle changes in attribute values (e.g., a product was blue, now it is red)? Is there a need to be able to report on its sales performance when it was red or blue, as well as for the product throughout the whole of its lifecycle?

shuijian-xu commented 5 years ago

The timeliness of capture. It now seems clear that absolute accuracy in a data warehouse is not a practical objective. There is a need to be able to assess the level of inaccuracy so that a degree of confidence can be applied to the results obtained from queries.

shuijian-xu commented 5 years ago

Synchronization of changes. When an attribute changes, a mechanism is required for identifying dependent attributes that might also need to be changed. The absence of synchronization affects the credibility of the results.

shuijian-xu commented 5 years ago

The biggest set of problems lies in the areas of the capture and accurate representation of historical information. The problem is most difficult when changes occur in the lifespan of dimensions and the relationships within dimensional hierarchies, also where attributes change their values and there is a requirement to faithfully reflect those changes through history.

shuijian-xu commented 5 years ago

One of the main problems is that the business requirements with respect to time have not been systematically captured at the conceptual level. This is largely because we are unfamiliar with the temporal semantics due to the fact that, so far, we have not encountered temporal applications.