shuijian-xu / hive

0 stars 0 forks source link

how do we actually design the data warehouse database? #67

Open shuijian-xu opened 4 years ago

shuijian-xu commented 4 years ago

The dimensional data model used to describe the data requirements was the star schema. Previously I said that the dimensional model was well supported by the relational model. We can create a relational schema that directly reflects the star schema.

The center of the star schema becomes a relational table, as does each of the dimensions of analysis. The center table is called the fact table because it contains all the facts (i.e., the sales) over which most of the queries will be executed. The dimensions become, simply, dimension tables.

The star schema could be interpreted as several tables (the dimensions) each having a one-to-many relationship with the fact table.

shuijian-xu commented 4 years ago

Entity Descriptions:

  1. Sales(CustomerCode, WineCode, OrderTimeCode, AreaCode, Quantity, ItemCost)

  2. Customer(CustomerCode, CustomerName, CustomerAddress )

  3. Wine(WineCode, Name, Vintage, ABV, PricePerBottle, PricePerCase)

  4. Area(AreaCode, AreaDescription)

  5. Time(TimeStamp, Date, PeriodNumber, QuarterNumber, Year)

Before we continue, it is worth explaining a few things about the diagram.

  1. There is no need to record the relationships and their descriptions. In a star schema there is an implicit relationship between the facts and each of the dimensions.

  2. The logical identifier of the facts is a composite identifier comprising all the identifiers of the dimensions.

  3. Notice the introduction of a time dimension table. The reasoning behind this is connected with the previous point about access to the fact table being driven from the dimension tables.

There are some special characteristics about the kind of data held in the fact table. We have described the primary-key attributes, but what about the non-primary-key attributes: Quantity and Item_Cost? These are the real facts in the fact table.

You can visualize lots and lots of rows each containing the composite key, a quantity, and item cost.

The point is, individual rows in themselves are not very meaningful in a system that is designed to answer strategic questions. In order to get meaningful information about trends, etc., the database needs to be able, as we've said before, to answer questions about sales over time. For instance, to obtain the report analyzing sales by product, the individual sales for a particular product have to be added together to form a total for the product. This means that most queries will need to access hundreds, thousands, or even millions of rows.

What this means, in effect, is that the data in the fact table or, more precisely, the facts in the fact table (Quantity and ItemCost) will almost always have some kind of set function applied to them.

Which of the standard column functions do you think will be the most heavily used in SQL queries against the fact table?

The answer is: the sum function. Almost all queries will use the summation function.

shuijian-xu commented 4 years ago

Let's look once more at those famous five questions and see whether our data warehouse can answer them.

Note that questions posed by the users of the data warehouse must usually be “interpreted” before they can be translated into an SQL query.

Question 1: Which product lines are increasing in popularity and which are decreasing?

Select Name,PeriodNumber, sum(Quantity) From Sales S,Wine W,Time T Where S.WineCode = W.WineCode And S.OrderTimeCode = T.TimeCode And T.PeriodNumber Between 012001 and 122002 Group by Name,PeriodNumber Order by Name,PeriodNumber

This query will result in a list of products, each line will show the name, the period, and total quantity sold for that period. If the result set were to be placed into a graph, the products that were increasing and decreasing in popularity would be clearly visible.

Question 2: Which product lines are seasonal?

Select Name, QuarterNumber, sum(Quantity), sum(ItemCost) From Sales S, Wine W, Time T Where S.WineCode = W.WineCode And S.OrderTimeCode = T.TimeCode And T.QuarterNumber Between "Q12001" and "Q42002" Group by Name, QuarterNumber Order by Name, QuarterNumber

Question 3: Which customers place the same orders on a regular basis?

This query shows each customer and the name of the wine, where the customer has ordered the same wine more than once:

Select CustomerName, WineName, Count() as "Total Orders" from Sales S, Customer C, Wine W Where S.CustomerCode = C.CustomerCode And S.WineCode = W.WineCode Group by CustomerName, WineName Having Count() > 1 Order By CustomerName, WineName

Question 4: Are some products more popular in different parts of the country?

This query shows, for each wine, both the number of orders and the total number of bottles ordered by area.

Select WineName, AreaDescription, Count(*) "Total Orders," Sum(Quantity) "Total Bottles" From Sales S, Wine W, Area A, Time T Where S.WineCode = W.WineCode And S.AreaCode = A.AreaCode And S.OrderTimeCode = T.TimeCode And T.PeriodNumber Between 012001 and 122002 Group by WineName,AreaDescription Order by WineName,AreaDescription

Question 5: Do customers tend to purchase a particular class of product?

This query presents us with a problem. There is no reference to the class of wine in the data warehouse. Information relating to classes does exist in the original EAR model. So it seems that the star schema is incomplete.

What we have to do is extend the Schema

Of course the Class information has to undergo the extraction and integration processing before it can be inserted into the database.

A foreign key constraint must be included in the Wine table to refer to the Class table.

The query can now be coded:

Select CustomerName, ClassName, Sum(Quantity) "TotalBottles" From Sales S,Wine W, Customer Cu, Class Cl, Time T Where S.WineCode = W.WineCode And S.CustomerCode = Cu.CustomerCode And W.ClassCode = Cl.ClassCode And S.OrderTimeCode = T.TimeCode And T.PeriodNumber Between 012001 and 122002 Group by CustomerName, ClassName Having Sum(Quantity) > 2 * (Select AVG(Quantity) From Sales S,Wine W, Class C, Time T Where S.WineCode = W.WineCode And W.ClassCode = C.ClassCode And S.OrderTimeCode = T.TimeCode And T.PeriodNumber Between 012001 and 122002) Order by CustomerName, ClassName

The query lists all customers and classes of wines where the customer has ordered that class of wine at more than twice the quantity as the average for all classes of wine. There are other ways that the query could be phrased. It is always a good idea to ask the directors precisely how they would define their questions in business terms before translating the question into an SQL query.

There are any number of ways the directors can question the data warehouse in order to answer their strategic business questions. We have shown that the data warehouse supports those types of questions in a way in which the operational applications could never hope to do.

The queries show very clearly that the arithmetic functions such as AVG() and particularly SUM() are used in just about every case.

Therefore, a golden rule with respect to fact tables can be defined:

The nonkey columns in the fact table must be summable.

Data attributes such as Quantity and ItemCost are summable, whereas text columns such as descriptions are not summable.

Unfortunately, it is not as straightforward as it seems. Care must be taken to ensure that the summation is meaningful. In some attributes the summation is meaningful only across certain dimensions. For instance, ItemCost can be summed by product, customer, area, and time with meaningful results. Quantity sold can be summed by product but might be regarded as meaningless across other dimensions.

Although this problem applies to the Wine Club, it is much more easily explained in a different organization such as a supermarket. While it is reasonable to sum sales revenue across products (e.g., the revenue from sales of apples added to the revenue from sales of oranges and other fresh fruit each contribute toward the sum of revenue for fresh fruit), adding the quantity of apples sold to the quantity of oranges sold produces a meaningless result.

Attributes that are summable across some dimensions, but not all dimensions, are referred to as semisummable attributes. Clearly they have a valuable role to play in a data warehouse, but their usage must be restricted to avoid the generation of invalid results.

shuijian-xu commented 4 years ago

So have we now completed the data warehouse design? Well not quite.

Remember that the fact table may grow to more than 62 million rows over time. There is the possibility, therefore, that a query might have to trawl through every single row of the fact table in order to answer a particular question. In fact, it is very likely that many queries will require a large percentage of the rows, if not the whole table, to be taken into account.

How long will it take to do that? The answer is - quite a long time. Some queries are quite complex, involving multiple join paths, and this will seriously increase the time taken for the result set to be presented back to the user, perhaps to several hours. The problem is exacerbated when several people are using the system at the same time, each with a complex query to run.

In principle, there is no need for rapid responses to strategic queries, as they are very different from the kind of day-to-day queries that are executed while someone is hanging on the end of the telephone waiting for a response. In fact, it could be argued that, previously, the answer was impossible to obtain, so even if the query took several days to execute, it would still be worth it.

That doesn't mean we shouldn't do what we can as designers to try to speed things up as much as possible.

Indexes might help, but in a great deal of cases the queries will need to access more than half the data, and indexes are much less efficient in those cases than a full sequential scan of the tables.

No, the answer lies in summaries.

Remember we said that almost all queries would be summing large numbers of rows together and returning a result set with a smaller number of rows. Well if we can predict, to some degree, the types of queries the users will mostly be executing, we can prepare some summarized fact tables so that the users can access those if they happen to satisfy the requirements of the query. Where the aggregates don't supply the required data, then the user can still access the detail.

If we question the users closely enough we should be able to come up with a set, maybe half a dozen or so, of summarized fact tables. The star schema and the snowflake principles still apply, but the result is that we have several fact tables instead of just one.

It should be emphasized that this is a physical design consideration only. Its only purpose is to improve the performance of the queries.

Some examples of summarization for the Wine Club might be:

  1. Customers by wine for each month
  2. Customers by wine for each quarter
  3. Wine by area for each month
  4. Wine by area for each quarter

Notice that the above examples are summarizing over time. There are other summaries, and you may like to try to think of some, but summarizing over time is a very common practice in data warehouses.

shuijian-xu commented 4 years ago

One technique that is very useful to people using the data warehouse is the ability to drill down from one summary level to a lower, more detailed level. For instance, you might observe that a certain range of products was doing particularly well or particularly badly. By drilling down to individual products, you can see whether the whole range or maybe just one isolated product is affected. Conversely, the ability to drill up would enable you to make sure, if you found one product performing badly, that the whole range is not affected.

This ability to drill down and drill up are powerful reporting capabilities provided by a data warehouse where summarization is used.

The usage of the data warehouse must be monitored to ensure that the summaries are being used by the queries that are exercising the database. If it is found that they are not being used, then they should be dropped and replaced by others that are of more use.

shuijian-xu commented 4 years ago

Summary Navigation

The introduction of summaries raises some questions:

How do users, especially noncomputer professionals, know which summaries are available and how to take advantage of them?

How do we monitor which summaries are, in fact, being used?

One solution is to use a summary navigation tool. A summary navigator is an additional layer of software, usually a third-party product, that sits between the user interface (the presentation layer) and the database.

The summary navigator receives the SQL query from the user and examines it to establish which columns are required and the level of summarization needed.