jostrm / azure-enterprise-scale-ml

Enterprise Scale AIFactory (esml) - on Azure
MIT License
34 stars 11 forks source link

Delta Lake support? #1

Open baatch opened 3 years ago

baatch commented 3 years ago

Does/will this support Delta Lake?

jostrm commented 3 years ago

Hi, thanks for asking @baatch . [UPDATE: 2021-09-04] A: ESML follows Azure storage roadmap, Azure ML roadmap, and Azure data factory roadmap. Currently we don't support delta lake for all 3 services - BUT ESML do support DELTA in its MASTER lake folder structure, since Azure Datafactory now has a connector (public preview) - See image:

image

[UPDATE end] Q: Why not also in PROJECT structure you might ask? Why .PARQUET in image for Azure ML Pipeline? -A) Since Azure ML Datastore is the abstraction layer for EMSL, when/IF Azure ML support delta-lake, ESML will also do that, and if Azure Datafacftory supports DeltaLake - ESML will support that, for MASTER data

I'm saying IF, since Deltalake might or might not be the future? Hence: -ESML had no lock-in to Delta-lake. -Only lock-in is the .parquet format for TABULAR data (no .ORC support) and using Azure services.

Notes&reflections: De facto Q:Is Deltalake de-facto standard? How to think about de facto? ....Before Kubernetes "won the "docker orchestration war"....where was a lot of investments to Docker Swarm, Mesosphere ....Before .parquet became popular, there were a lo of formats out there. (No, ESML don't support .orc or any other creatures for ESMLDatasets tabular data) ...so only time will tell I guess : )

Q: But we need DELTA-LAKE right, ALL the TIME right? otherwise we have to read ALL the file, and have no way to partition data or index? what about performance?

That said, Deltalake is great, and would improve performance. But ESML follows the principles "Keep it simple", "YAGNI" (You aint gonna need it), and for "PROJECT data" (train, score 1 ML model) usually a small amount of data is selected, compared to MASTER data or DW. Another reason is that ESML has been around long before Deltalake existed, hence not incorporated from the start.

...But it would make it less compatible towards services, and more complex. Which is not KISS or YAGNI.

baatch commented 3 years ago

@jostrm thanks for the answer.

I don't agree on the Delta Lake points you are making. Delta Lake and other Data Lake storage solutions such as Hudi and Iceberg solves many challenges (more than what you have listed) that a traditional Data Lake has and is being used in many companies and architectures.

Delta Lake is also supported in Data Factory, Azure Databricks and Azure Synapse Analytics and used in Microsoft Architectures. So even if it is not the standard it is a pretty big defacto standard that many companies have adopted.

Not supporting Delta Lake or other formats makes the solution (AutoLake) hard to consider for serious customers (existing or green field).

But the other parts of the solutions looks fantastic (Enterprise Security, Private Link, Data mesh support, MLOps, etc..) 😄

jostrm commented 3 years ago

@baatch , I think you missed out on the ESML readme file about MASTER/PROJECT in AutoLake. I added the info here:

[Note: ESML is born from serious customers :) Both green field enterprise customers, and enterprise customers who struggled with lake-design (swamp, or islands) that wants a clean slate. Sort of "proven practices", rather than best practices.] ....Wanting to avoid "inventing the wheel"

Also: We don't support Deltalake in Azure Machine Learning I'm afraid. ESML is "Enterprise Scale Machine Learning", and key point is to have a storage solution to work with Azure ML pipelines & Azure ML Datasets. -So 1 out of 3 does not cut it I'm afraid if we want an enterprise solution (MLOps, etc). Need full workflow & orchestration.

Sure you can create pipelines with Azure data factory/Azure Synapse Analytics + Databricks. And you can always have a DatabricksSteps in an Azure ML pipeline, but then again...a lot of acceleration is lost (DataDrift, ScoringDrift, ML-lineage, FPGA, DAshboards for interpretabiluty, root cause analysis)

ESML supports the flexibility of AutoML, and PythonScriptStep (FPGA), DataDrift, ScoringDrift e.g. supports things that's not supported with Azure Databricks alone I'm afraid.

If Iceberg and others "solves many" challenges, there might be another root cause (in lake design) or other purpose of lake (maybe DW/MASTER data purpose?...similar as Azure Synapse Analytics) ...IMO you should not be depending on DeltaLake to train/score a model...or refine data for your BU's Power BI report, both which we've been doing for years before DeltaLake. For MASTER-data, yes it makes sense and is great! But for project-data/ML-projects, it might be overkill. Just my experience.

MORE ABOUT ESML - keep it simple, accelerate, integrate horizontally across services ESML is an accelerator. And per definition an accelerator has a limitations able to accelerate. A "sharper point", more "preconfigured", with downside of not supporting all generic things in the world...but accelrates, streamlines

Example: Streamline - to keep it simple Challenge: Today there are 7 ways/components of passing data in an Azure ML pipeline PipelineData VS PipelineParameter + DataPath + DataPathComputeBinding VS Dataset + DatasetConsumptionConfig and OutputFileDatasetConfig ...and today no demo notebook / Documentation today, how to get it to work with Azure Datalake GEN2 (most notebooks are using legacy PipelineData option, and Blobstorage as datastore) ...and no support from Azure Data factory to pass dynamic paths to a Dataset, other than DataPath...which is not supported for ADLS GEN 2 datastores/datasets, hence a "workaround" is needed.

Solution: We only need 2 ways these (input/output to be generic).

baatch commented 3 years ago

@jostrm totally understand about the accelerator and keeping it simple points and that this is Azure ML focused solution and maybe not applicable to our Data Platform centric use cases.

Still if I was an customer with existing investments on Delta Lake (or other formats) it would be weird to not use the same standard format as the Master Lake for project specific deployments.

And for green field deployments I would have hard time to consider it for master lake if it does not have Delta Lake (or Hudi, Iceberg) support. Doing serious Data Engineering work (Streaming & batch, data quality, validation, etc..) on parquet only is a nightmare 😅

jostrm commented 3 years ago

Yes, you're right. It is purposed for Azure Machine Learning. (and Azure networking, private links etc) I have a hard time recommending a customer wanting to build Azure Machine learning models, to put data in something not supported in Azure ML, (often enterprises require GA SLA's on tech also) . But for data engineering with Azure Databricks, yes, DeltaLake works great.

What many do, is as I stated earlier, a "virtual ESML MASTER" (external DeltaLake. Gives: "best of breed") I recommend to have your DeltaLake as MASTER ( manage that design yourself, mirror the Dev,Test, Prod environments, private links, etc). And If you want to create Machine learning models, enterprise scale (organizational scale) then setup ESML and use the PROJECT structure only, to leverage enterprise grade ML with the 3 landingzones (Dev,Test, Prod) with turnkey MLOps, Lineage, ScoringDrift, etc.

Since DeltaLake does not come with a "design" auotmatically, you can of course "peek" into how the ESML Autolake design looks like, avoid invent the wheel to the MASTER-DeltaLake. To "mirror" design.

[We usually talk about this setup as a "virtual pointer" to the MASTER folder]

Only thing constant is change. Same as for datalake governance. Hence we need multiple environments, and versioning on the lake itself.....similar as as traditional client/server applications, "Database-table of version 1.3 " to fit the "backend v1.3 code".

Its all about the 3 of the 4 ESML ingrediencies in harmony I guess (BICEP provisioning + AutoLake+ESML SDK) - to have them all compatible, for an enterprise solution, supporting MLOps & CI/CD

jostrm commented 3 years ago

Image as example for "Example: Streamline - to keep it simple"

Would of course love it if/when Azure ML Studio datastores supports DeltaLake. Then ESML can leverage that also. But as of now, not supported. image

jostrm commented 3 years ago

Hi @baatch , UPDATE, NOW we do support DeltaLake format now : ) Thanks for your feedback. It was not that much work to "test through ESML" and update DOCS for that. Thanks to you we gave it a shot. Worked out great.

See updated architecture slide:

[UPDATE: 2021-09-04] A: ESML follows Azure storage roadmap, Azure ML roadmap, and Azure data factory roadmap. Currently we don't support delta lake for all 3 services, but ESML does support DELTA in its MASTER lake folder structure, since Azure Datafactory has this in public preview. See image:

image [UPDATE end]