romeokienzler / TheOpenSourceAIHandbook

Other
6 stars 0 forks source link

Create Data Lake content #1

Open romeokienzler opened 3 years ago

romeokienzler commented 3 years ago

Creating an open source data lake with SQL access is key to the reference architecture of this book

Please adjust the TOC for the data lake part so it fit's your needs. Please propose a reference architecture for data like like

matz3ibm commented 3 years ago

interesting factors are next to this:

matz3ibm commented 3 years ago
romeokienzler commented 3 years ago

Purpose of this book is to provide an end to end open source architecture for small and large enterprises. This architecture is modular. This means that the components can be replaced by commercial offerings where needed. E.g. use a commercial ETL tool over Kafka or SparkSQL.

My idea is to have S3 as the proposed storage technology/interface as all major cloud providers have S3 compatible storage and also with Ceph we have an open source version of it for on-prem or self managed S3 storage.

Hopefully the tools for data ingestion and building a SQL layer on top of S3 are also capable of working with alternatives e.g. HDFS, NFS or what ever but I'd rather not take this as a hard requirement.

romeokienzler commented 3 years ago

@matz3ibm do you have experience with Apache Drill? It seem to perfectly fit a SQL interface to S3 Object Store even providing a JDBC connector and seems to work nicely with Apache Superset

matz3ibm commented 3 years ago

Hi Romeo,

no, never used Drill.

matz3 is my username in the matrix.

Can't find Samaya (smadhava)...

regards Matthias Am Mittwoch, 22. Juli 2020, 05:14:23 MESZ hat Romeo Kienzler notifications@github.com Folgendes geschrieben:

@matz3ibm do you have experience with Apache Drill? It seem to perfectly fit a SQL interface to S3 Object Store even providing a JDBC connector and seems to work nicely with Apache Superset

— You are receiving this because you were mentioned.

Reply to this email directly, view it on GitHub, or unsubscribe.

romeokienzler commented 3 years ago

Just joined the RedHat Open Data Hub Community call yesterday. They are using the Apache SparkSQL Thrift Server over Apache Drill - sounds more sympatic to me. Will have a go on it...

matz3ibm commented 3 years ago

is this thread still active?

commtent regarding: sustainability factors of a data lake data lake or reservoir? The question eg how modular was/is my lake infrastructure build, eg. can I take out/export all data quality rules and put them on another system - not if used grammar is not standardized (eg. Java regex vs. Perl regex), or scripting for ingestion is not well seperated from data filtering because ingestion stages are not defined well - or another point, can I extend/move my reservoir to the cloud, run it as hybrid reservoir. A "sustainbale" point of a 'handbook' could be to catch the customer exactly at the knowledge stage they are and show with which methods to continue from that point, eg. to build a data reservoir as an extension to a working, but legacy technology based, datawarehouse architecture . Actually facing this problem.

romeokienzler commented 3 years ago

Dear @matz3ibm - yes - thanks for checking in - I was busy getting the Apache SparkSQL Thrift service running - or to be more precisely, to access it via Apache Superset as the jdbc:hive driver doesn't get installed for some reason, anyway..

I've sketched the idea of a hybrid data lake here https://youtu.be/prB15wjmQxc

So regarding storage I'm proposing to use S3 on the Cloud and On Prem. For on prem you can use Ceph or Minio for example.

But after reading your comment we definitely need to address clients where they currently are and the majority IMHO is with legacy DB, DWH, ETL and BI technology onprem. So I guess a nice transition scenario is to make use of the legacy by accessing it via ODBC/JDBC and also bring the data lake in the BI tool using a JDBC layer on top of S3

wrt. your comment on ingestion/filtering I'm proposing to use a wf engine like kubeflow, argo or airflow (currently my favorite is kubeflow) and make ingestion and filtering part of individual pipeline stages.

we're proposing to use Elyra-AI to design the KF Pipeline which currently can be composed of jupyter notebooks and python scripts

a kubeflow pipeline can consist of ANYTHING which runs on kubernetes including spark

this way the pipelines are a) portable b) versioned c) repeatable d) auditable

Shall we jump on a call? I'm flexible...anytime....