Open AdamDynamic opened 7 months ago
Hello hello!
Wanted to give an update on the current status. Lalo, please feel free to hop in with any details a may be missing.
The current taxonomy that we have is here:
I was also auto-booted from the other repo so I cannot see the previous taxonomy. But, I'm sure with some quick fixes in airflow we can do quick transformations.
Currently, we are ingesting the actuals and budget data from Iplicit via an API endpoint they provide for a function called an "Enquiry." This Enquiry gives us all the actual and budget data and gives us which program/project we'd need via the fields parentdepartment and department.
We are transforming this data via Airflow and Airbyte. Airflow allows us to automate the frequency with which we pull the data from Iplicit. Airbyte allows us to apply scripts to transform that data. From there we push the Actuals, Budget, and soon -- the Forecast Datasets into Superset to build atop and visualize. From the what I see in Iplicit, we may even be able to add forecasts directly into the Budget data as a prebuilt Enquiry with forecast numbers in the enquiry, this is worth looking into.
Moving forward, I'll get access to that repo again to assure that the taxonomy we have aligns with what we wanted, and Lalo and Johannes work to add forecast data.
Thanks @Fergulati. Some questions / comments:
The current taxonomy that we have is here:
Are these parameters those that are native to Iplicit? Or are these parameters those that are applied to all data in the lake (including non-financial data)?
If the former, which parameters are being dropped / aggregated when the actuals and / or budget data is being pulled into the data lake (or are we copying the data on a line-by-line basis?) If indeed we are copying every record across to the data lake, are you sufficiently confident that the infrastructure / reporting will scale as the volume of transactions increases? (i.e. users won't have to wait 30 seconds for a report to load because it's reading 2 million rows of data every time...)
In either instance, what are the valid values for each of the fields? e.g. if I want to compare the marketing spend for Codex in Q1 2024 to the change in their number of twitter followers in the same period, what parameter would I need to filter both data sets on (department
? parentdepartment
?) and what is the standardised code / tag for the Codex project?
Airbyte allows us to apply scripts to transform that data
- That a format for forecast datasets be agreed with the Finance team (e.g. a *.csv file populated with data standardised per the point above, and with certain fields and data types)
Regarding Airbyte and the original request quoted above, how can we figure out what subset of the parameters listed above need to be included in the forecast *.csv
file in order for the forecast data, once transformed by Airbyte and imported into the data lake, to be useable / useful?
To pick an example, I'm assuming we don't need to include a invoicedate
field in the forecast (the forecast won't be that granular) but we would need to include account codes (accountcode
?) to distinguish different cost types (travel, subscriptions, etc).
Hey @AdamDynamic
The data is fetched through an API call that executes an Enquiry. Enquiries within Iplicit are pre-built reports that you can run to generate real-time reports.
The schema is then a result of the enquiry report we are using. In this case we are using the budget vs actual
enquiry which you can run yourself on the Analytics tab within Iplicit. The result is the table mentioned above by @Fergulati. This is the raw table we are receiving from the API call, so nothing has been dropped or aggregated, it's a direct copy of the enquiry.
To run the enquiry we need to select a budget. To test the connection we've been using "forecast 3". The connector we've built has the capacity to ingest multiple enquiries if needed.
Enquiry tab in Iplicit:
Superset explorer:
The data architecture is designed so if we start having performance issues, we can scale the system up. We are also only fetching the data relevant to the enquiry so it's not copying all the records within Iplicit.
Team and projects names are always matched to BBHR organizational data. The naming convention of teams and projects is then managed and standardized through BBHR for all dashboards.
Regarding the structure of the forecast csv file you can find my proposal here. Let me know if you have any requests or changes.
Finally, this is available in superset already. We can start building a dashboard using the test data from the sandbox. https://superset.bi.status.im/superset/dashboard/51/
Thanks Lalo. Some initial questions / comments:
Is the data held in a database? Or is it being pulled from the Iplicit API in real time? If the latter, how can this data set be combined with other datasets in the data lake? For example, if I wanted to display a bar chart of operating costs over 12 months for a specific Program, where the first 4 months of plotted data were actuals data from Iplicit, and the last 8 months were forecast data (uploaded via a *.csv
file, or otherwise), how should I do that?
How can fields be added, filtered or omitted on the Iplicit side, before they are pulled through to dashboards? i.e. I don't want to expose the descriptions in SuperSet, and some fields won't be relevant / useful. We will also need to simplify how FX is presented (i.e. convert all costs / revenue to USD).
The department
, parent department
and Account Code
fields are numeric, how should the user map these to their corresponding text description? i.e. which department is Codex, Nomos, etc?
e.g. if I want to compare the marketing spend for Codex in Q1 2024 to the change in their number of twitter followers in the same period, what parameter would I need to filter both data sets on (department? parentdepartment?) and what is the standardised code / tag for the Codex project?
Team and projects names are always matched to BBHR organizational data. The naming convention of teams and projects is then managed and standardized through BBHR for all dashboards.
I understand that we ingest social media metrics into the data lake. Against which specific field in BambooHR is e.g. the Status App twitter account mapped to in the data lake? On what field of each data set would I need to JOIN
the data in order to compute e.g. total operating cost per new twitter follower, split by month, from 01 January 2024 to 31 May 2024?
Minor point, but when I modify the table, the scroll bars seem to constantly flash on and off.
A basic query combining both datasets given that they have the same structure would look something like this:
SELECT * FROM raw_iplicit.actuals
UNION ALL
SELECT * FROM raw_finance.forecast
3 and 4. The ID values for department
and parentdepartment
match the ID values for BBHR project
and program
. You can join these two and get the text fields. Social tables also have a project
and program
fields where you could join actuals data to twitter data.
Happy to jump on a call if you have any questions or to show you how it works.
Background
Datasets
Three datasets are relevant to the request:
Request
*.csv
file populated with data standardised per the point above, and with certain fields and data types)Timing
Assumptions
References