shuijian-xu / hive

0 stars 0 forks source link

slowly changing dimensions can cause the database to have to generate hundreds or thousands of extraneous new records to be inserted when dealing with simple hierarchies that exist in all organizations. How? #100

Open shuijian-xu opened 4 years ago

shuijian-xu commented 4 years ago

Sales_Area(Sales_Area_Key, Sales Area Code, Manager key, Sales Area Name)

Manager(Manager_Key, Manager Code, Manager Name)

Customer(Customer_Key, Customer code, Customer Name, Customer Address, Sales Area key, Hobby Code, Date Joined)

The second solution to slowly changing dimensions is called Type 2. Type 2 is a more complex solution than Type 1 and does attempt to faithfully record historical values of attributes by providing a form of version control.

We will assume that we have implemented the type two solution to slowly changing dimensions.

If sales area SW was to experience a change of managers from M9 to M12, then a new sales area record would be inserted with an incremented counter, together with the new manager_code. So if the previous record was (1, SW, M9, “South West”), the new record with its new key, assumed to be 5, would contain (5, SW, M12, “South West”).

However, that is not the end of the change. Each of the customers, from the “SW” sales area still have their foreign key relationship references pointing to the original sales area record containing the reference to the old manager (surrogate key 1). Therefore, we also have to create an entire set of new records for the customers, with each of their sales area key values set to “5”. In this case, there are 11,498 new records to be created. It is not valid to simply update the foreign keys with the new value, because the old historical link will be lost.

Where there are complex hierarchies involving more levels and more rows, it is not difficult to imagine very large volumes of inserts being generated. For example, in a four-level hierarchy where the relationship is just 1:100 in each level, a single change at the top level will cause over a million new records to be inserted. A relationship of 1:100 is not inordinately high when there are many data warehouses in existence with several million customers in the customer dimension alone.

The number of extraneous insertions generated by this approach could cause the dimension tables to grow at a rate that, in time, becomes a threat to performance.