Create Data Lake content

romeokienzler commented 3 years ago

Creating an open source data lake with SQL access is key to the reference architecture of this book

Please adjust the TOC for the data lake part so it fit's your needs. Please propose a reference architecture for data like like

Ceph/S3 as storage layer
Presto/Hive/Iceberg/delta lake as SQL access layer
note: would be nice to have JDBC access to the data like, also interactive query performance if possible e.g. to connect a BI tool like Apache Superset to it

matz3ibm commented 3 years ago

interesting factors are next to this:

practicability to move the data lake/reservoir from one cloud provider to another or take it on-prem
multi tenant capabilities in terms of data protection laws in different countries (eg. possibility of exclusive admin-control access for all component, complete disentanglement of all data)
what are the sustainability factors of a data lake ?

matz3ibm commented 3 years ago

it would be good to discuss open source in context of data lakes, enterprise usage and sustainability. Data lakes are used exclusively by large companies (and what does the rest?), as I understood they prefer to work with a heterogeneous vendor landscape (?!). Whereas eg. Kafka as OS and a 3rd provider to do the support seems to be a great success model, replacing tools like informatica seems not well accepted (? prove by evaluation on OS software usage in detail required?). Probably the reason behind Kafka's success is the focus to ETL related usage, and not transactional message based queuing, which was an "expensive implementation load" and justification of the highly priced licenses of IBM MQSeries. Key question is "Why and Where use Open Source" or "Where and when makes it sense to use Open Source"? (in the depicted context) Looks like a pattern :-)

romeokienzler commented 3 years ago

Purpose of this book is to provide an end to end open source architecture for small and large enterprises. This architecture is modular. This means that the components can be replaced by commercial offerings where needed. E.g. use a commercial ETL tool over Kafka or SparkSQL.

My idea is to have S3 as the proposed storage technology/interface as all major cloud providers have S3 compatible storage and also with Ceph we have an open source version of it for on-prem or self managed S3 storage.

Hopefully the tools for data ingestion and building a SQL layer on top of S3 are also capable of working with alternatives e.g. HDFS, NFS or what ever but I'd rather not take this as a hard requirement.

practicability to move the data lake/reservoir from one cloud provider to another or take it on-prem Romeo: Interesting. I guess it is possible but expensive - therefore this is IMHO the most critical step and has the highest risk for lock-in
multi tenant capabilities in terms of data protection laws in different countries (eg. possibility of exclusive admin-control access for all component, complete disentanglement of all data) Romeo: This sound worth a separate Chapter on data life-cycle management, access rights, ... .what you think?
what are the sustainability factors of a data lake ? Romeo: what you mean by that, can you give us an example please?
it would be good to discuss open source in context of data lakes, enterprise usage and sustainability. Data lakes are used exclusively by large companies (and what does the rest?), as I understood they prefer to work with a heterogeneous vendor landscape (?!). Whereas eg. Kafka as OS and a 3rd provider to do the support seems to be a great success model, replacing tools like informatica seems not well accepted (? prove by evaluation on OS software usage in detail required?). Probably the reason behind Kafka's success is the focus to ETL related usage, and not transactional message based queuing, which was an "expensive implementation load" and justification of the highly priced licenses of IBM MQSeries. Key question is "Why and Where use Open Source" or "Where and when makes it sense to use Open Source"? (in the depicted context) Looks like a pattern :-) Romeo: I guess data lakes also make sense for smaller companies as well. Because making use of external data also requires capabilities of processing big data. I think lack of skills and innovative power and to be afraid of the topic prevents smaller companies for adopting data lakes and falling into the heterogeneous multi vendor commercial software "trap", what you think?

romeokienzler commented 3 years ago

@matz3ibm do you have experience with Apache Drill? It seem to perfectly fit a SQL interface to S3 Object Store even providing a JDBC connector and seems to work nicely with Apache Superset

matz3ibm commented 3 years ago

Hi Romeo,

no, never used Drill.

matz3 is my username in the matrix.

Can't find Samaya (smadhava)...

regards Matthias Am Mittwoch, 22. Juli 2020, 05:14:23 MESZ hat Romeo Kienzler notifications@github.com Folgendes geschrieben:

@matz3ibm do you have experience with Apache Drill? It seem to perfectly fit a SQL interface to S3 Object Store even providing a JDBC connector and seems to work nicely with Apache Superset

— You are receiving this because you were mentioned.

Reply to this email directly, view it on GitHub, or unsubscribe.

romeokienzler commented 3 years ago

Just joined the RedHat Open Data Hub Community call yesterday. They are using the Apache SparkSQL Thrift Server over Apache Drill - sounds more sympatic to me. Will have a go on it...

matz3ibm commented 3 years ago

is this thread still active?

commtent regarding: sustainability factors of a data lake data lake or reservoir? The question eg how modular was/is my lake infrastructure build, eg. can I take out/export all data quality rules and put them on another system - not if used grammar is not standardized (eg. Java regex vs. Perl regex), or scripting for ingestion is not well seperated from data filtering because ingestion stages are not defined well - or another point, can I extend/move my reservoir to the cloud, run it as hybrid reservoir. A "sustainbale" point of a 'handbook' could be to catch the customer exactly at the knowledge stage they are and show with which methods to continue from that point, eg. to build a data reservoir as an extension to a working, but legacy technology based, datawarehouse architecture . Actually facing this problem.

romeokienzler commented 3 years ago

Dear @matz3ibm - yes - thanks for checking in - I was busy getting the Apache SparkSQL Thrift service running - or to be more precisely, to access it via Apache Superset as the jdbc:hive driver doesn't get installed for some reason, anyway..

I've sketched the idea of a hybrid data lake here https://youtu.be/prB15wjmQxc

So regarding storage I'm proposing to use S3 on the Cloud and On Prem. For on prem you can use Ceph or Minio for example.

But after reading your comment we definitely need to address clients where they currently are and the majority IMHO is with legacy DB, DWH, ETL and BI technology onprem. So I guess a nice transition scenario is to make use of the legacy by accessing it via ODBC/JDBC and also bring the data lake in the BI tool using a JDBC layer on top of S3

wrt. your comment on ingestion/filtering I'm proposing to use a wf engine like kubeflow, argo or airflow (currently my favorite is kubeflow) and make ingestion and filtering part of individual pipeline stages.

we're proposing to use Elyra-AI to design the KF Pipeline which currently can be composed of jupyter notebooks and python scripts

a kubeflow pipeline can consist of ANYTHING which runs on kubernetes including spark

this way the pipelines are a) portable b) versioned c) repeatable d) auditable

Shall we jump on a call? I'm flexible...anytime....

romeokienzler / TheOpenSourceAIHandbook

Create Data Lake content #1