shuijian-xu / hive

0 stars 0 forks source link

The need to monitor churn and to establish the reasons for it tends to create a need for queries that return results in the form of a time series. The following exemplifies the type of questions we would like to express #40

Open shuijian-xu opened 5 years ago

shuijian-xu commented 5 years ago
  1. How many customers did we lose during the last quarter of 2000, compared to 1999 and 1998? The result of such a query would be a time series containing three periods and a number attached to each period. This is an example of a temporal selection query.

  2. Of the customers who were lost, how many had been customers continuously for at least one year? The loss of long-standing customers might be considered to be as a result of worsening service. The result from this question is also a time series, but the query contains an examination of durations of time. So this query is a state duration query.

  3. How many of these customers experienced a change of administration because they moved in the year that they left? Perhaps they are unhappy with the level of service provided by the new area. This question is concerned with the existence of the relationship between the customer and the sales area. It is also an example of a transition detection query.

  4. How many price changes did they experience in the year that they left? Perhaps they were unhappy with the number of price rises imposed. This is similar to question 3 in that it is a transition detection query, but it is applied to the value of an attribute, in this case the selling price of a bottle of wine, instead of a relationship.

shuijian-xu commented 5 years ago

These requirements cannot be satisfied using a single Boolean attribute to describe the existence of a dimension, as there is a requirement to make comparisons between dates. Neither can the queries be expressed using a single date attribute for the reason previously explained. It seems clear that the expression of such queries requires a composite existence attribute that is, logically, a period comprising a start time and an end time.

It has been shown that row timestamping can provide a solution in many cases, but not all, and the resulting queries are complex to write. A simpler solution is sought.

shuijian-xu commented 5 years ago

The approach to be adopted will be to use a separate existence attribute, in the form of a composite start time and end time, for each dimension, relationship, and attribute where retrospection has been defined to be true. It is assumed that these existence attributes will be held in separate tables. So the existence attribute for customers' existence will be held in a table called “CustomerExist” and the existence attribute for the period during which a customer lives in a sales area will be called “CustomerSalesAreaExist.”

Using the composite existence attribute, the first query can be expressed as in Query Listing 6.2.

Code Listing 6.2. Count of customers who have left during Q4 year on year.

select 'Q4 2000' as quarter, count() from CustomerExist ce where ce.ExistenceEnd between '2000/10/01' and '2000/12/31' union select 'Q4 1999' as quarter, count() from CustomerExist ce where ce.ExistenceEnd between '1999/10/01' and '1999/12/31' union select 'Q4 1998' as quarter, count(*) from CustomerExist ce where ce.ExistenceEnd between '1998/10/01' and '1998/12/31'

The query in Listing 6.2 could have been answered using row timestamping only if the end timestamp was updated to show that the customer was no longer active. The distinction is made between the existence attribute and the row timestamp because the existence attribute is a single-purpose attribute that purely records the existence of the customer. The row timestamp is, as has been stated, a multipurpose attribute that records other types of changes as well. In order to express the query using row timestamps, it would have to be written as a correlated subquery to ensure that only the latest record for the customer was evaluated. This means that discontinuous existences could not be detected.

The second query can be expressed as in Query Listing 6.3.

Code Listing 6.3. Count of long-standing customers lost.

select 'Q4 2000' as quarter, count() from CustomerExist ce where ce.ExistenceEnd between '2000/10/01' and '2000/12/31' and (ce.ExistenceEnd - ce.ExistenceStart) > 365 union select 'Q4 1999' as quarter, count() from CustomerExist ce where ce.ExistenceEnd between '1999/10/01' and '1999/12/31' and (ce.ExistenceEnd - ce.ExistenceStart) > 365 union select 'Q4 1998' as quarter, count(*) from CustomerExist ce where ce.ExistenceEnd between '1998/10/01' and '1998/12/31' and (ce.ExistenceEnd - ce.ExistenceStart) > 365

In order to express the third query, it is assumed that there is a separate existence attribute for the relationship between the customer and the sales area. This is shown in Query Listing 6.4.

Code Listing 6.4. Lost customers who moved.

select 'Q4 2000' as quarter, count() from CustomerExist ce, CustomerSalesAreaExist csa where ce.ExistenceEnd between '2000/10/01' and '2000/12/31' and (ce.ExistenceEnd - ce.ExistenceStart) > 365 and ce.CustomerCode = csa.CustomerCode and csa.ExistenceStart between '2000/01/01' and '2000/12/31' union select 'Q4 1999' as quarter, count() from CustomerExist ce, CustomerSalesAreaExist csa where ce.ExistenceEnd between '1999/10/01' and '1999/12/31' and (ce.ExistenceEnd - ce.ExistenceStart) > 365 and ce.CustomerCode = csa.CustomerCode and csa.ExistenceStart between '1999/01/01' and '1999/12/31' union select 'Q4 1998' as quarter, count(*) from CustomerExist ce, CustomerSalesAreaExist csa where ce.ExistenceEnd between '1998/10/01' and '1998/12/31' and (ce.ExistenceEnd - ce.ExistenceStart) > 365 and ce.CustomerCode = csa.CustomerCode and csa.ExistenceStart between '1998/01/01' and '1998/12/31'

The query in Listing 6.4 is an example of a combined state duration and transition detection query.

As with the other queries, in order to express the fourth query, it is assumed that there is a separate existence attribute for the bottle price (Listing 6.5).

Code Listing 6.5. Lost customers affected by price increases.

select 'Q4 2000' as quarter, ce.CustomerCode, count(distinct spe.WineCode) from CustomerExist ce, SalesPriceExist spe, Sales s, Time t where ce.ExistenceEnd between '2000/10/01' and '2000/12/31' and (ce.ExistenceEnd - ce.ExistenceStart) > 365 and ce.CustomerCode = s.CustomerCode and s.WineCode = spe.WineCode and s.TimeCode = t.TimeCode and t.year = 1998 and spe.ExistenceStart between '2000/01/01' and '2000/12/31' group by quarter, ce.CustomerCode having count(distinct spe.WineCode) > 5 union select 'Q4 1999' as quarter, ce.CustomerCode, count(distinct spe.WineCode) from CustomerExist ce, SalesPriceExist spe, Sales s, Time t where ce.ExistenceEnd between '1999/10/01' and '1999/12/31' and (ce.ExistenceEnd - ce.ExistenceStart) > 365 and ce.CustomerCode = s.CustomerCode and s.WineCode = spe.WineCode and s.TimeCode = t.TimeCode and t.year = 1997 and spe.ExistenceStart between '1999/01/01' and '1999/12/31' group by quarter, ce.CustomerCode having count(distinct spe.WineCode) > 5 union select 'Q4 1998' as quarter, ce.CustomerCode, count(distinct spe.WineCode) from CustomerExist ce, SalesPriceExist spe, Sales s, Time t where ce.ExistenceEnd between '1998/10/01' and '1998/12/31' and (ce.ExistenceEnd - ce.ExistenceStart) > 365 and ce.CustomerCode = s.CustomerCode and s.WineCode = spe.WineCode and s.TimeCode = t.TimeCode and t.year = 1996 and spe.ExistenceStart between '1998/01/01' and '1998/12/31' group by quarter, ce.CustomerCode having count(distinct spe.WineCode) > 5

The query in Query Listing 6.5 shows customers who left the club in the last quarter of the year and who had experienced more than five price changes during the year. This is another example of a combined state duration and transition detection query.

In dealing with the issue of churn, this approach of trying to detect patterns of behavior of customers is typical.

shuijian-xu commented 5 years ago

Referring to the queries expressed in Listings 6.2–6.5, the first query (Listing 6.2) is: How many customers did we lose during the last quarter of 2000, compared to 1999 and 1998? Using the time dimension, it can be expressed as shown in Listing 6.6.

Code Listing 6.6. Count of customers lost during Q4, using the time dimension.

select t.Quarter, count(*) from CustomerExist ce, Time t where ce.ExistenceEnd = t.TimeCode and t.Quarter in ('Q42000', 'Q41999', 'Q41998') group by t.Quarter

The query from Listing 6.3 is: Of the customers who were lost, how many had been customers continuously for at least one year? This can be expressed as follows:

Code Listing 6.7. Count of long-standing customers lost, using the time dimension.

select t.Quarter, count(*) from CustomerExist ce, Time t where ce.ExistenceEnd = t.TimeCode and t.Quarter in ('Q42000', 'Q41999', 'Q41998') and (ce.ExistenceEnd - ce.ExistenceStart) > 365 group by t.Quarter

The third query, from Listing 6.4, is: How many of the customers experienced a change of administration because they moved in the year that they left? Using the same assumptions as before, the query can be expressed as in Query Listing 6.8.

Code Listing 6.8. Lost customers who moved, using the time dimension.

select t1.Quarter, count(*) from CustomerExist ce, CustomerSalesAreaExist csa, Time t1, Time t2 where ce.ExistenceEnd = t1.TimeCode and t1.Quarter in ('Q42000', 'Q41999', 'Q41998') and (ce.ExistenceEnd - ce.ExistenceStart) > 365 and ce.CustomerCode = csa.CustomerCode and csa.ExistenceStart = t2.TimeCode and t2.Year = t1.Year group by t1.Quarter

Finally, the fourth query, from Query Listing 6.5, is: How many price changes did they experience in the year that they left? This can be expressed as shown in Query Listing 6.9.

Code Listing 6.9. Lost customers affected by price increases, using the time dimension.

select t1.Quarter, ce.CustomerCode, count(distinct spe.WineCode) from CustomerExist ce, SalesPriceExist spe, Sales s, Time t1, Time t2, Time t3 where ce.ExistenceEnd = t1.TimeCode and t1.Quarter in ('Q42000', 'Q41999', 'Q41998') and (ce.ExistenceEnd - ce.ExistenceStart) > 365 and ce.CustomerCode = s.CustomerCode and s.WineCode = spe.WineCode and spe.ExistenceStart = t2.TimeCode and s.TimeCode = t3.TimeCode and t2.Year = t1.Year and t3.Year = t2.Year group by t1.Quarter, ce.CustomerCode having count(distinct spe.WineCode) > 5

Thus, we can conclude that allowing the time dimension to be joined to other dimensions, when existence attributes are used, enables a simpler expression of some temporal queries