The Enterprise Scale AI Factory
is a plug and play solution that automates the provisioning, deployment, and management of AI projects on Azure with a template way of working.
1) Marry mutliple best practices & accelerators:
It reuses multiple existing Microsoft accelerators/landingzone architecture and best practices such as CAF & WAF, and provides an end-2-end experience including Dev,Test, Prod environments.
PRIVATE
networking: Private endpoints for all services such as Azure Machine Learning, private AKS cluster, private Container registry, Storage, Azure data factory, Monitoring etc
Plug-and-play
: Dynamicallly create infra-resources per team, including networking dynamically, and RBAC dynamicallyTemplate way of working & Project way of working:
The AI Factory is project based
(cost control, privacy, scalability per project) and provides multiple templates besides infrastructure template: DataLake template, DataOps templates, MLOps templates
, with selectable project types.Same MLOps
- weather data scientists chooses to work from Azure Databricks or Azure Machine Learning` - same MLOps template is used.Common way of working, common toolbox, a flexible one
: A toolbox with a LAMBDA architecture with tools such as: Azure Datafactory, Azure Databricks, Azure Machine Learning, Eventhubs, AKS
5) Enterprise scale & security & battle tested
: Used by customers and partners with MLOps since 2019 (see LINKS) to accelerate the development and delivery of AI solutions, with common tooling & marrying multiple best practices. Private networking (private endpoints), as default.AI factory - setup in 60h
- End-2-End pipelines for use case: Howto
AI factory
- Technical BLOG
Microsoft: AI Factory
documentation (CAF/MLOps): Machine learning operations - Cloud Adoption Framework | Microsoft Learn
The Documentation is organized around ROLES via Doc series.
Doc series | Role | Focus | Details |
---|---|---|---|
10-19 | CoreTeam |
Governance |
Setup of AI Factory. Governance. Infrastructure, networking. Permissions |
20-29 | CoreTeam |
Usage |
User onboarding & AI Factory usage. DataOps for the CoreTeam's data ingestion team |
30-39 | ProjectTeam |
Usage |
Dashboard, Available Tools & Services, DataOps, MLOps, Access options to the private AIFactory |
40-49 | All |
FAQ |
Various frequently asked questions. Please look here, before contacting an ESML AIFactory mentor. |
It is also organized via the four components of the ESML AIFactory:
Component | In section | Focus in section | Role | Doc series |
---|---|---|---|---|
1) Infra:AIFactory | Y | - | CoreTeam | 10-19 |
2) Datalake template | Y | - | All | 20-29,30-39 |
3) Templates for: DataOps, MLOps, *LLMOps | Y | - | All | 20-29, 30-39 |
4) Accelerators: ESML SDK (Python, PySpark), RAG Chatbot, etc | Y | - | ProjectTeam | 30-39 |
CAF/AI Factory
: https://docs.microsoft.com/en-us/azure/cloud-adoption-framework/ready/azure-best-practices/ai-machine-learning-mlops#mlops-at-organizational-scale-ai-factoriesMicrosoft Intelligent Data Platform
: https://techcommunity.microsoft.com/t5/azure-data-blog/microsoft-and-databricks-deepen-partnership-for-modern-cloud/ba-p/3640280
Modern data architecture with Azure Databricks and Azure Machine Learning
: https://docs.microsoft.com/en-us/azure/architecture/solution-ideas/articles/azure-databricks-modern-analytics-architectureDatalake design
: https://docs.microsoft.com/en-us/azure/storage/blobs/data-lake-storage-best-practices
Datamesh
: https://martinfowler.com/articles/data-mesh-principles.html
ESML AI Factory
.
Enterprise "cockpit"
over ALL your projects & models.
state
a project are in (Dev,Test,Prod states) with cost dashboard
per project/environmentDate | Category | What | Link |
---|---|---|---|
2024-03 | Automation | Add core team member | 26-add-esml-coreteam-member.ps1 |
2024-03 | Automation | Add project member | 26-add-esml-project-member.ps1 |
2024-03 | Tutorial | Core-team tutorial | 10-AIFactory-infra-subscription-resourceproviders.md |
2024-03 | Tutorial | End-user tutorial | 01-jumphost-vm-bastion-access.md |
2024-03 | Tutorial | End-user tutorial | 03-use_cases-where_to_start.md |
2024-02 | Tutorial | End-user installation Compute Instance | R01-install-azureml-sdk-v1+v2.m |
2024-02 | Datalake - Onboarding | Auto-ACL on PROJECT folder in lakel | - |
2023-03 | Networking | No Public IP: Virtual private cloud - updated networking rules | https://learn.microsoft.com/en-us/azure/machine-learning/v1/how-to-secure-workspace-vnet?view=azureml-api-1&preserve-view=true&tabs=required%2Cpe%2Ccli |
2023-02 | ESML Pipeline templates | Azure Databricks: Training and Batch pipeline templates. 100% same support as AML pipeline templates (inner/outer loop MLOps) | - |
2022-08 | ESML infra (IaC) | Bicep now support yaml as well | - |
2022-10 | ESML MLOps | ESML MLOps v3 advanced mode, support for Spark steps ( Databricks notebooks / DatabrickStep ) | - |
Innovating with AI and Machine Learning, multiple voices expressed the need to have an Enterprise Scale AI & Machine Learning Platform
with end-2-end
turnkey DataOps
and MLOps
.
Other requirements were to have an enterprise datalake design
, able to share refined data across the organization
, and high security
and robustness: General available technology only, vNet support for pipelines & data with private endpoints. A secure platform, with a factory approach to build models.
Even if best practices exists, it can be time consuming and complex
to setup such a AI Factory solution
, and when designing an analytical solution a private solution without public internet is often desired since working with productional data from day one is common, e.g. already in the R&D phase. Cyber security around this is important.
Challenge 1:
Marry multiple, 4, best practicesChallenge 2:
Dev, Test, Prod Azure environments/Azure subscriptionsChallenge 3:
Turnkey: Datalake, DataOps, INNER & OUTER LOOP MLOps
Also, the full solution should be able to be provisioned 100% via infrastructure-as-code
, to be recreated and scale across multiple Azure subscriptions, and project-based
to scale up to 250 projects - all with their own set of services such as their own Azure machine learning workspace & compute clusters.To meet the requirements & challenge, multiple best practices needed to be married and implemented, such as: CAF/WAF, MLOps, Datalake design, AI Factory, Microsoft Intelligent Data Platform / Modern Data Architecture.
An open source initiative could help all at once, this open-source accelerator Enterprise Scale ML(ESML) -
to get an AI Factory on Azure
ESML
provides an AI Factory
quicker (within 4-40 hours), with 1-250 ESMLProjects, an ESML Project is a set of Azure services glued together securely.
Challenge 1 solved:
Marry multiple, 4, best practicesChallenge 2 solved:
Dev, Test, Prod Azure environments/Azure subscriptionsChallenge 3 solved:
Turnkey: Datalake, DataOps, INNER & OUTER LOOP MLOps
ESML marries multiple best practices
into one solution accelerator
, with 100% infrastructure-as-codeProject team
in the ESML AI Factory
:The Azure Devops/BICEP can optionally integrate with ITSM system as a "ticket" in ServiceNow/Remedy/JIRA Service Desk. The below info is needed for the ESML provisioning:
Based on this reference architecture: https://docs.microsoft.com/en-us/azure/architecture/solution-ideas/articles/azure-databricks-modern-analytics-architecture
This repository is a push-only mirror. Ping Joakim Åström for contributions / ideas.
Since "mirror-only" design, Pull requests are not possible, except for ESML admins. See LICENCE file (open source, MIT license)
Speaking of open source, contributors:
Kim Berg
and Ben Kooijman
for contributing! (kudos to the ESML IP calculator and Bicep for esml-project)Christofer Högvall
for contributing! (kudos to the Powershell script, to enable Resource providers, if not exits)
azure-enterprise-scale-ml\environment_setup\aifactory\bicep\esml-util\26-enable-resource-providers.ps1
ESML Controlplane SDK?
templates
for both Azure Data factory and Azure machine learning pipeline templates - ESML autogenereated Azure ML Pipelines
1-click
a new ESMLProject in Azure Devops, serviecs glued together with private endpoints
(network & identity)INNER
and OUTER LOOP
(can talk across Dev,Test, Prod Azure ML workspaces
)ESML controlplane
can compare scoring from model in DEV workspace
with TEST workspace
, and register the model in an external workspace (this with also network security: vNets & private endpoints, NSG's, FW)
Q1:I want to use Azure AutoML, with MLOps ready to be turned ON
, with datalake design automatically generated for me, including BRONZE, SILVER, GOLD
concept
AutoLake™
for Azure ML Studio.22 DEMO notebooks
End-2-End MLOps, with Azure ML Pipelines, using Azure datalake GEN 2 all the way
- from Azure datafactory, in Azure ML Pipelines/Datasets.Q2:I want to do ML, but only R&D phase - I don't need MLOps or DEV,TEST, PROD environments. Can I still get benefits of ESML - get a quick DEV env & AutoLake?
Quick setup:
You can setup ESML for 1 environment only (have same subscriptionID for all 3).
settings
& notebook_demo
folder (but no need to copy MLOPS folder)R&D Mode:
Run ESML SDK with ESMLProject.rnd=True
, and dataset-versioning will be turned off, but you still get a AutoLake
with bronze, silver, gold concept. Q3:I want to do ML, but NOT AutoML
- just scikit learn, my own model. Can I still leverage ESML, besides training step?
AutoLake
and other ESML accelerators
.
AutoML
first approach.
Q: How was this accelerator born, and what is it based on? It this for me?
A:Working with multiple enterprise customers (aviation, manufacturing, space, energy and retail industry), we noticed common non-industry-specific
challenges, to scale across projects, that ESML solves - an organizational scalability.
extends
Azure Machine Learning via accelerators, organizational agnostic - since the project/teams
concept in ESML.data refinement/datalake/machine learning
to build faster. enterprise grade solution design & scalability
(dev,test, prod environments) - across subscriptions. Note: You can use this for any enterprise grade
solution in need of single or multi-subscription solutions, with an enterprise datalake
need, DEV only
need, or DEV->TEST->PROD
need.
best practices
and customer proven practices
Q6 ESML AI Factory: Can I just use the Azure ML SDK directly? Instead of the ESML SDK?
backpack
)Azure certificates
are listed good to have in the backpack
ESML has MLOps embedded
, and adds NEW
concepts to enrich Azure ML Studio:
enterprise CONCEPTS
(Project/Model/Dev_Test_Prod)` - able to scale across Azure subscriptions in DEV, TEST, PROD for a model.accelerators for data refinement, with CONCEPTS
: Bronze, Silver, Gold, able to share refined data ACROSS projects
& modelsautomatically
generates Azure ML pipelines
of 7 types, with the data model IN->Bronze->Silver-Gold
(we will refer to this as IN_2_GOLD
)accelerators for ML CONCEPTS
such as SCORE vs INFERENCE
,ESMLPipelieFactory
(auto-creates pipeline), Auto-Split to TRAIN,VALIDATE, TEST
(auto-register).marries
MLOps
with AutoML
- you get working MLOps template with support for Azure AutoML.don't need to remember folder paths
- since the ESML Datalake design and automapping
of Azure ML Datasets, if you work with the ESML SDK
(Python, Pyspark)p.split_to_gold()
esmldataset.Bronze.Save(dataframe_state)
- the Bronze dataset will be created, and a new version (if not p.rnd=True) is created for you.2 lines of code
!! (This is possible due to the 4 ingrediences in ESML)as-is
, but probably : ) you want to add your data wrangling
per IN_TO_SILVER
step, in the 1-M auto-generated ds_name_by_config.py
scripts
Azure ML is great, it improves pipeline creation with 90% fewer lines of code to https://azure.microsoft.com/en-us/services/machine-learning/#features
I love when I get asked to push the boundries, and asks where dropping in from multiple places:
batch scoring pipeline
)
DeltaLake
on Azure datalake GEN2 and Azure ML pipelines with Azure Datalake GEN 2
Datastore.Automapping
Data to Azure ML Datasets - only possible due to the ESML datalake
MLOps
- Example: compare model in DEV subscription with TEST subscriptionTEST_SET Scoring
to Azure ML Studio, as TAGSautomatically calculate TEST-SET scoring
, 1 line of code (works for classification or regression), and this will be TAGGED on the Azure ML Dataset GOLD_TEST
and also on the Model
Scoring Drift / Concept Drift
to promote newly trained model (also as step in ESML MLOps pipeline)Settings
enteprise settings
(dev,test,prod), is usually set & decided once, by an enterprise architect
, and all ESML Projects
inherits these, but can override
them also, if use case needs that.
enteprise settings
, a projects sets the project specific
settings.Model settings
with your weights
that will decide last registered model = best
compare_metrics
that YOU control, and can put WEIGHTs
on also."docs1"
, "docs2"
,"docs3"
text in imageWEIGHTS
when comparing scoring for model A and B, to see if we want ot promote model Aprivate attached AKS cluster to Azure ML (BICEP)
to the ESML Projects keuvault.
enteprise settings
is a Dev_Test (1 node AKS-cluster), if TEST or PROD environment an autoscale cluster
decided by ESML core team
enteprise settings
settings, is usually set & decided once, by an enterprise architect
in the ESML core team, and all ESML Projects
inherits these, but can override
them also, if use case needs that.