status-im / insights-bi

1 stars 0 forks source link

Extracting, transforming and loading financial data (including from Iplicit) for use in dashboards #13

Open AdamDynamic opened 7 months ago

AdamDynamic commented 7 months ago

Background

Datasets

Three datasets are relevant to the request:

  1. Actuals data - historic financial data, as incurred by the organisation over time. Held in Iplicit and Xero (the latter for periods prior to 31 December 2023)
  2. Budget data - A dataset of costs, agreed by the organisation for a certain Program or Project at a certain point in time and that extends for an agreed period (i.e. agreed in Q4 2024 and applied to a certain Program for the period 01 January 2025 - 31 December 2025). Held in Iplicit for 2024.
  3. Forecast data - Periodically produced by the Finance team to estimate what the Actuals data will be over a period in the future. Note that this dataset may include the use of Monte Carlo simulations to create datasets of N possible outcomes, based on simulations of the various inputs. Location and format is to be confirmed (see below)

Request

  1. That a data taxonomy be agreed so that the datasets described above, and any existing datasets included in the MI database, are standardised using this taxonomy per this ticket raised previously (note: I no longer have access to that repo so I'm not able to determine whether the status of the issue)
  2. That actuals data and budget data held in Iplicit be extracted, standardised into the agreed format, and populated into the reporting database
  3. That the extraction process for actuals and budget data runs automatically, on times and dates that can be easily configured by the Finance team
  4. That a format for forecast datasets be agreed with the Finance team (e.g. a *.csv file populated with data standardised per the point above, and with certain fields and data types)
  5. That the Finance team be provided with the tools and knowledge required to upload (and if necessary, delete) forecast data from the reporting database

Timing

Assumptions

  1. The existing MI database that is used and maintained by the Insights team should be used to hold the actuals, budget and forecast data

References

Fergulati commented 6 months ago

Hello hello!

Wanted to give an update on the current status. Lalo, please feel free to hop in with any details a may be missing.

The current taxonomy that we have is here:

I was also auto-booted from the other repo so I cannot see the previous taxonomy. But, I'm sure with some quick fixes in airflow we can do quick transformations.

Currently, we are ingesting the actuals and budget data from Iplicit via an API endpoint they provide for a function called an "Enquiry." This Enquiry gives us all the actual and budget data and gives us which program/project we'd need via the fields parentdepartment and department.

We are transforming this data via Airflow and Airbyte. Airflow allows us to automate the frequency with which we pull the data from Iplicit. Airbyte allows us to apply scripts to transform that data. From there we push the Actuals, Budget, and soon -- the Forecast Datasets into Superset to build atop and visualize. From the what I see in Iplicit, we may even be able to add forecasts directly into the Budget data as a prebuilt Enquiry with forecast numbers in the enquiry, this is worth looking into.

Moving forward, I'll get access to that repo again to assure that the taxonomy we have aligns with what we wanted, and Lalo and Johannes work to add forecast data.

AdamDynamic commented 6 months ago

Thanks @Fergulati. Some questions / comments:

The current taxonomy that we have is here:

Are these parameters those that are native to Iplicit? Or are these parameters those that are applied to all data in the lake (including non-financial data)?

If the former, which parameters are being dropped / aggregated when the actuals and / or budget data is being pulled into the data lake (or are we copying the data on a line-by-line basis?) If indeed we are copying every record across to the data lake, are you sufficiently confident that the infrastructure / reporting will scale as the volume of transactions increases? (i.e. users won't have to wait 30 seconds for a report to load because it's reading 2 million rows of data every time...)

In either instance, what are the valid values for each of the fields? e.g. if I want to compare the marketing spend for Codex in Q1 2024 to the change in their number of twitter followers in the same period, what parameter would I need to filter both data sets on (department? parentdepartment?) and what is the standardised code / tag for the Codex project?

Airbyte allows us to apply scripts to transform that data

  1. That a format for forecast datasets be agreed with the Finance team (e.g. a *.csv file populated with data standardised per the point above, and with certain fields and data types)

Regarding Airbyte and the original request quoted above, how can we figure out what subset of the parameters listed above need to be included in the forecast *.csv file in order for the forecast data, once transformed by Airbyte and imported into the data lake, to be useable / useful?

To pick an example, I'm assuming we don't need to include a invoicedate field in the forecast (the forecast won't be that granular) but we would need to include account codes (accountcode?) to distinguish different cost types (travel, subscriptions, etc).

JoseAnaya28 commented 6 months ago

Hey @AdamDynamic

The data is fetched through an API call that executes an Enquiry. Enquiries within Iplicit are pre-built reports that you can run to generate real-time reports.

The schema is then a result of the enquiry report we are using. In this case we are using the budget vs actual enquiry which you can run yourself on the Analytics tab within Iplicit. The result is the table mentioned above by @Fergulati. This is the raw table we are receiving from the API call, so nothing has been dropped or aggregated, it's a direct copy of the enquiry.

To run the enquiry we need to select a budget. To test the connection we've been using "forecast 3". The connector we've built has the capacity to ingest multiple enquiries if needed.

Enquiry tab in Iplicit:

image

Superset explorer: image

The data architecture is designed so if we start having performance issues, we can scale the system up. We are also only fetching the data relevant to the enquiry so it's not copying all the records within Iplicit.

Team and projects names are always matched to BBHR organizational data. The naming convention of teams and projects is then managed and standardized through BBHR for all dashboards.

Regarding the structure of the forecast csv file you can find my proposal here. Let me know if you have any requests or changes.

Finally, this is available in superset already. We can start building a dashboard using the test data from the sandbox. https://superset.bi.status.im/superset/dashboard/51/

AdamDynamic commented 6 months ago

Thanks Lalo. Some initial questions / comments:

  1. Is the data held in a database? Or is it being pulled from the Iplicit API in real time? If the latter, how can this data set be combined with other datasets in the data lake? For example, if I wanted to display a bar chart of operating costs over 12 months for a specific Program, where the first 4 months of plotted data were actuals data from Iplicit, and the last 8 months were forecast data (uploaded via a *.csv file, or otherwise), how should I do that?

  2. How can fields be added, filtered or omitted on the Iplicit side, before they are pulled through to dashboards? i.e. I don't want to expose the descriptions in SuperSet, and some fields won't be relevant / useful. We will also need to simplify how FX is presented (i.e. convert all costs / revenue to USD).

  3. The department, parent department and Account Code fields are numeric, how should the user map these to their corresponding text description? i.e. which department is Codex, Nomos, etc?

e.g. if I want to compare the marketing spend for Codex in Q1 2024 to the change in their number of twitter followers in the same period, what parameter would I need to filter both data sets on (department? parentdepartment?) and what is the standardised code / tag for the Codex project?

Team and projects names are always matched to BBHR organizational data. The naming convention of teams and projects is then managed and standardized through BBHR for all dashboards.

  1. I understand that we ingest social media metrics into the data lake. Against which specific field in BambooHR is e.g. the Status App twitter account mapped to in the data lake? On what field of each data set would I need to JOIN the data in order to compute e.g. total operating cost per new twitter follower, split by month, from 01 January 2024 to 31 May 2024?

  2. Minor point, but when I modify the table, the scroll bars seem to constantly flash on and off.

Screenshot 2024-05-16 at 10 58 50 Screenshot 2024-05-16 at 10 58 41

JoseAnaya28 commented 6 months ago
  1. The connector extracts the latest data from Iplicit at scheduled intervals and stores it in our data warehouse. The data remains current until the connector's next execution. Also, there are various data ingestion methods are supported, such as incremental updates and full refreshes. This process applies to both actual and forecast data regardless on how we upload it. You can query, transform, and visualize these datasets using Superset or apply custom transformations through a dbt model. Additionally, because all datasets are stored together in the data warehouse, you can easily combine data from different sources. In essence, you will have access to two tables in the data warehouse. Then we are able to write a query to retrieve four months of data from Iplicit, eight months of forecast data, and filter by a specific program.

A basic query combining both datasets given that they have the same structure would look something like this:

  SELECT * FROM raw_iplicit.actuals 
  UNION ALL  
  SELECT * FROM raw_finance.forecast
  1. There are 2 ways. You can modify the enquiry to only include the fields you need or we can configure the connector with the required fields. For example, if we didn't need AccountId and AccountCode we can tell the connector to not get fetch fields:
image

3 and 4. The ID values for department and parentdepartment match the ID values for BBHR project and program. You can join these two and get the text fields. Social tables also have a project and program fields where you could join actuals data to twitter data.

  1. Yes, this was just to show that the data is already available within Superset. It'll look a lot better

Happy to jump on a call if you have any questions or to show you how it works.