[Hubs] Support for large datasets

flanakin commented 1 year ago

📝 Scenario

As a FinOps practitioner, I need to ingest data into a queryable data store in order to report on data at scale beyond $5M/mo

💎 Solution

Support large datasets (e.g., 500 GB/mo) with up to 7 years of historical data that refreshes when changed by adding an option to ingest data into Azure Data Explorer and update reporting to leverage that database.

📋 Tasks

### Required tasks
- [x] Decide on data store: SQL, ADX, Synapse
- [ ] #300
- [ ] #301
- [ ] #376
- [ ] Update to ingest FOCUS 1.0
- [ ] De-duplicate data that gets re-exported by Cost Management
- [x] Confirm the ADX SKU
- [ ] Update Power BI reports
- [ ] #670
- [ ] #671
- [ ] Update CreateUiDefinition.json
- [ ] Create pipeline to start/shutdown the ADX cluster based on settings.json config
- [ ] Auto-start ADX before and shutdown after ingestion
- [ ] Run CM exports on a custom schedule

### Stretch goals
- [ ] Backfill all data in storage during setup
- [ ] Implement retention policies for parquet data
- [ ] Should we archive parquet data after ingestion?
- [ ] #377 
- [ ] #667 
- [ ] #668

ℹ️ Additional context

There was an internal analysis of the optimal data store to use for the largest datasets and Azure Data Explorer was deemed to be the best option that balanced cost, performance, and scale.

🙋‍♀️ Ask for the community

We could use your help:

Please vote this issue up (👍) to prioritize it.
Leave comments to help us solidify the vision.

flanakin commented 9 months ago

Closing this since we're tracking releases in a new way now and this is outdated.

t-esslinger commented 9 months ago

Hello @flanakin, is this feature still in your backlog? We would we highly interested in being able to handle also larger datasets more easily.

flanakin commented 4 months ago

@t-esslinger Sorry for missing the comment. Yes, this is still in the backlog. We're making progress slowly. I'm reopening this issue to track everything needed.

microsoft / finops-toolkit