Open romeokienzler opened 3 years ago
interesting factors are next to this:
Purpose of this book is to provide an end to end open source architecture for small and large enterprises. This architecture is modular. This means that the components can be replaced by commercial offerings where needed. E.g. use a commercial ETL tool over Kafka or SparkSQL.
My idea is to have S3 as the proposed storage technology/interface as all major cloud providers have S3 compatible storage and also with Ceph we have an open source version of it for on-prem or self managed S3 storage.
Hopefully the tools for data ingestion and building a SQL layer on top of S3 are also capable of working with alternatives e.g. HDFS, NFS or what ever but I'd rather not take this as a hard requirement.
@matz3ibm do you have experience with Apache Drill? It seem to perfectly fit a SQL interface to S3 Object Store even providing a JDBC connector and seems to work nicely with Apache Superset
Hi Romeo,
no, never used Drill.
matz3 is my username in the matrix.
Can't find Samaya (smadhava)...
regards Matthias Am Mittwoch, 22. Juli 2020, 05:14:23 MESZ hat Romeo Kienzler notifications@github.com Folgendes geschrieben:
@matz3ibm do you have experience with Apache Drill? It seem to perfectly fit a SQL interface to S3 Object Store even providing a JDBC connector and seems to work nicely with Apache Superset
— You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub, or unsubscribe.
Just joined the RedHat Open Data Hub Community call yesterday. They are using the Apache SparkSQL Thrift Server over Apache Drill - sounds more sympatic to me. Will have a go on it...
is this thread still active?
commtent regarding: sustainability factors of a data lake data lake or reservoir? The question eg how modular was/is my lake infrastructure build, eg. can I take out/export all data quality rules and put them on another system - not if used grammar is not standardized (eg. Java regex vs. Perl regex), or scripting for ingestion is not well seperated from data filtering because ingestion stages are not defined well - or another point, can I extend/move my reservoir to the cloud, run it as hybrid reservoir. A "sustainbale" point of a 'handbook' could be to catch the customer exactly at the knowledge stage they are and show with which methods to continue from that point, eg. to build a data reservoir as an extension to a working, but legacy technology based, datawarehouse architecture . Actually facing this problem.
Dear @matz3ibm - yes - thanks for checking in - I was busy getting the Apache SparkSQL Thrift service running - or to be more precisely, to access it via Apache Superset as the jdbc:hive driver doesn't get installed for some reason, anyway..
I've sketched the idea of a hybrid data lake here https://youtu.be/prB15wjmQxc
So regarding storage I'm proposing to use S3 on the Cloud and On Prem. For on prem you can use Ceph or Minio for example.
But after reading your comment we definitely need to address clients where they currently are and the majority IMHO is with legacy DB, DWH, ETL and BI technology onprem. So I guess a nice transition scenario is to make use of the legacy by accessing it via ODBC/JDBC and also bring the data lake in the BI tool using a JDBC layer on top of S3
wrt. your comment on ingestion/filtering I'm proposing to use a wf engine like kubeflow, argo or airflow (currently my favorite is kubeflow) and make ingestion and filtering part of individual pipeline stages.
we're proposing to use Elyra-AI to design the KF Pipeline which currently can be composed of jupyter notebooks and python scripts
a kubeflow pipeline can consist of ANYTHING which runs on kubernetes including spark
this way the pipelines are a) portable b) versioned c) repeatable d) auditable
Shall we jump on a call? I'm flexible...anytime....
Creating an open source data lake with SQL access is key to the reference architecture of this book
Please adjust the TOC for the data lake part so it fit's your needs. Please propose a reference architecture for data like like