sfu-db / dataprep

Open-source low code data preparation library in python. Collect, clean and visualization your data in python with a few lines of code.
http://dataprep.ai
MIT License
2k stars 203 forks source link

Types of insights. #199

Open dylanzxc opened 4 years ago

dylanzxc commented 4 years ago

Hi, I have summarized the insights from Power BI and papers that we can probably include in dataprep. I think we can have a discussion on this.

  1. Category outliers (top/bottom): Highlights cases where one or two categories have much larger values than other categories.
  2. Change points in a time series: Highlights when there are significant changes in trends in a time series of data.
  3. Correlation:
  4. Cross-measure Correlation: Reports cross-measure analysis results regarding the remarkable correlation between two measures.
  5. Scatterplot Clustering(2D clustering): A scatterplot is generated by: two-measure breakdowns by a specific dimension. Clustering on scatterplot is complementary to the Cross measure correlation, to address the cases where data distribution over the 2-dimensional scatterplot is complicated.
  6. Low Variance(Evenness): Detects cases where data points for a dimension have very low mean in other words, a relative flat histogram.
  7. Majority (Major factors): Finds cases where a majority of a total value can be attributed to a single factor when broken down by another dimension.
  8. Overall trends in time series: Detects upward or downward trends in time series data.
  9. Seasonality in time series: Finds periodic patterns in time-series data, such as weekly, monthly, or yearly seasonality.
  10. Steady share: Highlights cases where there is a parent-child correlation between the share of a child value in relation to the overall value of the parent across a continuous variable. The steady share insight applies to the context of a measure, a dimension, and another date/time dimension.
  11. Time series outliers: For data across a time series, detects when there are specific dates or times with values significantly different than the other date/time values.
  12. Outstanding N(Top N): N members of a dimension have much larger values than other members of that dimension. (Outstanding No. 1, Outstanding Top 2, Outstanding Last)
  13. Attribution: Among a comparison group with non-negative aggregation results, Attribution shows the fact that the leading value dominates the group.

The following two links are very helpful and there are examples to most of the insights I mentioned above. https://www.microsoft.com/en-us/research/uploads/prod/2016/12/Insight-Types-Specification.pdf https://docs.microsoft.com/en-us/power-bi/consumer/end-user-insight-types @jnwang @jinglinpeng @dovahcrow @brandonlockhart @eutialia @Sanjana12111994 @Waterpine

jnwang commented 4 years ago

Good job @dylanzxc . This is a good starting point. I would like to hear your thoughts on which insights should we implement for dataprep and are there any other insights that we should include?

To answer the first question, you can put each insight into the following categories. Below I show an example of Category outliers. In this way, you can identify which insights are more useful and common to us.

To answer the second question, you can ask yourself what you are looking for when seeing, e.g., a histogram. You can check i) dispersion (whether the data is concentrated around the mean), ii) skewness, ii) following a certain distribution. Then, you may add insights like high dispersion, high skewness, fit to a normal distribution, etc.

dylanzxc commented 4 years ago

@jnwang Thank you, Professor. You gave me a very good outline to consider this problem. Based on the existing plots we have in dataprep, I think we can include the following insights.

Categorical Data Insights:

  1. Category Outliers Add the insight with bar chart in plot(df), plot(df,x)

  2. Majority/Attribution(share >=50): Outstanding 1 Outstanding 2 Outstanding last

Add these insights with bar chart and pie chart in plot(df), plot(df,x)

  1. Evenness: plot(df), plot(df,x) Add the insight with bar cahrt or pie chart

Numerical Data Insights:

  1. Outstanding 1: Outstanding 2: Outstanding last(for both positive and negative aggregation results)

Add these insights with histogram in plot(df), plot(df,x)

A related question yield is, are we gonna count outstanding on each bin or on every single value. If it’s on each bin then it's easy to implement, which is very similar to the categorical data, however, if it’s based on every single value then we need a hypothesis test algorithm(existing paper to learn).

  1. Low Variance: Add the insight with the histogram in plot(df), plot(df,x)
  2. Normal distribution Add the insight with the qq plot in plot(df,x)
  3. Skewness: Add the insight with the kde plot in plot(df,x)
  4. Dispersion: Add the insight with the kde plot in plot(df,x)

Datetime Data Insights:

  1. Change point in a time series:
  2. Overall trend in time series
  3. Seasonality in time Series
  4. Time series outliers Add insight with line chart in plot(df), plot(df,x)

Correlation Insights:

  1. Correlation Add the insight with the heat map in plot_correlation(df), plot_correlation(df,x)
  2. Scatterplot Clustering Linghao added a scatterplot to the report, we can use that and generate insight beside it.

Plot_missing insights

  1. Outstanding N Add insight with heatmap in plot_missing(df)
  2. Steady Share(We don't have other graphs containing shares.) Add insight with hist and bar charts in plot_missing(df,x)

For question 2 are there any other insights that we should include?

I think besides adding more insights, we can discover more dimensions of the datasets in other words add more groupby() and show the above insights on the new dimensions or cross dimensions. For example, for categorical data, we can group by each category and for datatime, we can group by month, week, or more specific intervals to yield insights. I think this kind of approaches are very common and useful in business analysis area such as user habit analysis.

My plan for this task is, starting with easier insights using existing graphs and computation results, doing research at the same time for more ideas. One of the potential difficulty is, to generate some of the above insights(eg. time-series insights), we need a reliable and rigorous algorithm from statistics' point of view such as the following screenshot. This could take some time. image

I'd like to hear your advice :) @dovahcrow @jinglinpeng

jinglinpeng commented 4 years ago

Good job @dylanzxc . In a high level, there are two places to show insights:

  1. For each fig. in current function(plot, plot_correlation and plot_missing), we show the insight associated with that fig.

  2. We have a place to show the ranked top-k insights of the whole dataset. Still, the insight is associated with the fig.

I think currently the first part is clear. For the second part, it is unclear 1) what new insights we want to add. 2) whether we need a separate function such as insight(df) or we can directly put it into plot(df, insight = True). I agree with your plan. Let us work on the first part as a starting point.