Power BI Aggregation Feature

Aggregation feature allow Power BI to create a cached data inside Power BI Service, where the aggregated data are imported to the cloud and your report can take advantage of this cached data and load your visuals faster because the query will hit first this cached data before to try to submit a query to your data source (direct query).

In order to achieve this you need to know about:

Composite Model (Import + Direct Query in the same model)
How to create a custom SQL query in order to aggregate your data
Using data flows to import your custom query

Reasons to use Aggregation feature:

Large data (100M+ rows)
Speed up the refresh
Reduce model size
Future Proof

Step by step

In order to achive the aggregation feature here's the step by step we need to apply to understand and set the aggregation feature:

Define the grain level
Query the aggregation grain level and import to your model
The fact table and also de dimensions should be using direct query
Apply the composite model (Import for the aggregated data and direct query for the fact/dimensions)
Define your aggregated table
Use DAX studio to test if you're hitting the aggregated data

AndreMaciel66 commented 3 years ago

Working with large datasets from CDP using Power BI

This article explain how to consume large data from Cargill Data Platform using Power BI.

Defining and measure your raw data

In this example, we are going to read the customer profitability table from customers

What is the size of table?

To know the size of your table we have several ways, the simplest in this case is to make a count.

SELECT 
    COUNT(*) AS rows_quantity
FROM
    schema_name.customer_profit

The result of this example is:

rows_quantity
343.165.452

We have already done some tests using dataflows and using our best gateway (closest to Hadoop servers, using Amazon AWS) and unfortunately it is not possible to read, it literally takes over hours until a time-out is returned.

Aggregate your data in dataflows

To get around this problem we will use dataflows to aggregate our data, and simplify and be able to read fewer records for a given analysis.

Therefore, the next query shows the following dimensions and a sum measure.

SELECT account_description, 
       billing_date, 
       bpc_account_parent_4_description, 
       customer_segment_description, 
       l2_code_description, 
       plant_country, 
       Sum(usd_value) AS sum_usd 
FROM   schema.table
WHERE  1 = 1 
       AND bpc_account_parent_4_description = 'Gross Profit'
       AND customer_segment_description IS NOT NULL 
       AND customer_segment_description <> '' 
       AND customer_segment_description <> 'Non Classified' 
       AND customer_segment_description <> 'X' 
GROUP  BY account_description, 
          billing_date, 
          bpc_account_parent_4_description, 
          customer_segment_description, 
          l2_code_description, 
          plant_country

The above query returns a total of 4,250,566 rows, which means only 1% of our total rows.

New entity from a blank query

Open your PBI Workspace
Click in New >> Dataflow
Click in Define new Tables >> Add New Table
On the data sources list select Blank Query

let
  Source = Odbc.Query("host=hostnamem;port=00000;driver={Cloudera ODBC Driver for Impala};usesystemtruststore=1;ssl=1;checkcertrevocation=0;authmech=1","")
in
  Source

Select the Data Source Gateway - Give preference to the AWS gateway
There will be an important message saying "The evaluation was canceled because it contained a native database query for the data source with kind "Odbc" and path "host=Peanut-impala.cargill.com;port=21050;driver={Cloudera ODBC Driver for Impala};usesystemtruststore=1;ssl=1;checkcertrevocation=0;authmech=1". Native queries may be unsafe and alter the database. Click Continue to allow all native queries to run." you can click in continue to accept the native query execution through Power BI Service
Name your query, remember, let's use meaningful
Save your dataflow, again, let's use meaningful name:
Refresh your dataflows and Set the schedule, The refresh time in this example was just 1:19 (one minute and nineteen seconds)

Composite Model

Allows a report to have two or more data connections from different source groups, such as one or more DirectQuery connections and an import connection, two or more DirectQuery connections, or any combination thereof.

Get data from your table (Raw data) using Impala connector + Direct Query
Get data from Dataflows in your Power BI Desktop
Check your model to make sure you'll have the aggregation table as import and your raw data as Direct Query

Applying Aggregation

In order to apply your aggregation, you'll need to map your fields and measures.

In your model view click on the ... from your Aggregated table and Manage Aggregations
Map your field attributes and measures

Tips

You can use DAX Studio to check if you're hitting the aggregation

greendatalab / codash

Power BI - How to apply Aggregation Feature in Power BI - Documentation #19