greendatalab / codash

codash view (front end)
0 stars 0 forks source link

Power BI - How to apply Aggregation Feature in Power BI - Documentation #19

Open diego91964 opened 3 years ago

diego91964 commented 3 years ago
AndreMaciel66 commented 3 years ago

Power BI Aggregation Feature

Aggregation feature allow Power BI to create a cached data inside Power BI Service, where the aggregated data are imported to the cloud and your report can take advantage of this cached data and load your visuals faster because the query will hit first this cached data before to try to submit a query to your data source (direct query).

In order to achieve this you need to know about:

Reasons to use Aggregation feature:

Step by step

In order to achive the aggregation feature here's the step by step we need to apply to understand and set the aggregation feature:

  1. Define the grain level
  2. Query the aggregation grain level and import to your model
  3. The fact table and also de dimensions should be using direct query
  4. Apply the composite model (Import for the aggregated data and direct query for the fact/dimensions)
  5. Define your aggregated table
  6. Use DAX studio to test if you're hitting the aggregated data
AndreMaciel66 commented 3 years ago

Working with large datasets from CDP using Power BI

This article explain how to consume large data from Cargill Data Platform using Power BI.

Defining and measure your raw data

In this example, we are going to read the customer profitability table from customers

What is the size of table?

To know the size of your table we have several ways, the simplest in this case is to make a count.

SELECT 
    COUNT(*) AS rows_quantity
FROM
    schema_name.customer_profit

The result of this example is:

rows_quantity
343.165.452

We have already done some tests using dataflows and using our best gateway (closest to Hadoop servers, using Amazon AWS) and unfortunately it is not possible to read, it literally takes over hours until a time-out is returned.

Aggregate your data in dataflows

To get around this problem we will use dataflows to aggregate our data, and simplify and be able to read fewer records for a given analysis.

Therefore, the next query shows the following dimensions and a sum measure.

SELECT account_description, 
       billing_date, 
       bpc_account_parent_4_description, 
       customer_segment_description, 
       l2_code_description, 
       plant_country, 
       Sum(usd_value) AS sum_usd 
FROM   schema.table
WHERE  1 = 1 
       AND bpc_account_parent_4_description = 'Gross Profit'
       AND customer_segment_description IS NOT NULL 
       AND customer_segment_description <> '' 
       AND customer_segment_description <> 'Non Classified' 
       AND customer_segment_description <> 'X' 
GROUP  BY account_description, 
          billing_date, 
          bpc_account_parent_4_description, 
          customer_segment_description, 
          l2_code_description, 
          plant_country

The above query returns a total of 4,250,566 rows, which means only 1% of our total rows.

New entity from a blank query

  1. Open your PBI Workspace
  2. Click in New >> Dataflow
  3. Click in Define new Tables >> Add New Table
  4. On the data sources list select Blank Query
let
  Source = Odbc.Query("host=hostnamem;port=00000;driver={Cloudera ODBC Driver for Impala};usesystemtruststore=1;ssl=1;checkcertrevocation=0;authmech=1","")
in
  Source
  1. Select the Data Source Gateway - Give preference to the AWS gateway
  2. There will be an important message saying "The evaluation was canceled because it contained a native database query for the data source with kind "Odbc" and path "host=Peanut-impala.cargill.com;port=21050;driver={Cloudera ODBC Driver for Impala};usesystemtruststore=1;ssl=1;checkcertrevocation=0;authmech=1". Native queries may be unsafe and alter the database. Click Continue to allow all native queries to run." you can click in continue to accept the native query execution through Power BI Service
  3. Name your query, remember, let's use meaningful
  4. Save your dataflow, again, let's use meaningful name:
  5. Refresh your dataflows and Set the schedule, The refresh time in this example was just 1:19 (one minute and nineteen seconds)

Composite Model

Allows a report to have two or more data connections from different source groups, such as one or more DirectQuery connections and an import connection, two or more DirectQuery connections, or any combination thereof.

  1. Get data from your table (Raw data) using Impala connector + Direct Query
  2. Get data from Dataflows in your Power BI Desktop
  3. Check your model to make sure you'll have the aggregation table as import and your raw data as Direct Query

Applying Aggregation

In order to apply your aggregation, you'll need to map your fields and measures.

  1. In your model view click on the ... from your Aggregated table and Manage Aggregations
  2. Map your field attributes and measures

Tips