Open lorenzznerol opened 3 years ago
The following is from https://www.knowledgehut.com/blog/big-data/kafka-vs-spark:
Apache Spark and Apache Kafka can be used to stream data (instead of just loading altogether). Spark is a standalone solution with an MLib library that is for ML tasks on top of the Spark architecture. This seems better suited for ML than Kafka at first sight, since Kafka is more about real-time processing, while Spark allows some latency. And we do not need the fastest processing time possible, we can have some latency if we calculate ML models.
Kafka streams Use-cases:
Following are a couple of many industry Use cases where Kafka stream is being used:
The New York Times: The New York Times uses Apache Kafka and Kafka Streams to store and distribute, in real-time, published content to the various applications and systems that make it available to the readers. Pinterest: Pinterest uses Apache Kafka and the Kafka Streams at large scale to power the real-time, predictive budgeting system of their advertising infrastructure. With Kafka Streams, spend predictions are more accurate than ever. Zalando: As the leading online fashion retailer in Europe, Zalando uses Kafka as an ESB (Enterprise Service Bus), which helps us in transitioning from a monolithic to a micro services architecture. Using Kafka for processing event streams enables our technical team to do near-real time business intelligence. Trivago: Trivago is a global hotel search platform. We are focused on reshaping the way travellers search for and compare hotels while enabling hotel advertisers to grow their businesses by providing access to a broad audience of travellers via our websites and apps. As of 2017, we offer access to approximately 1.8 million hotels and other accommodations in over 190 countries. We use Kafka, Kafka Connect, and Kafka Streams to enable our developers to access data freely in the company. Kafka Streams powers parts of our analytics pipeline and delivers endless options to explore and operate on the data sources we have at hand.
Broadly, Kafka is suitable for microservices integration use cases and have wider flexibility. Spark Streaming Use-cases:
Following are a couple of the many industries use-cases where spark streaming is being used:
Booking.com: We are using Spark Streaming for building online Machine Learning (ML) features that are used in Booking.com for real-time prediction of behaviour and preferences of our users, demand for hotels and improve processes in customer support. Yelp: Yelp’s ad platform handles millions of ad requests every day. To generate ad metrics and analytics in real-time, they built the ad event tracking and analyzing pipeline on top of Spark Streaming. It allows Yelp to manage a large number of active ad campaigns and greatly reduce over-delivery. It also enables them to share ad metrics with advertisers in a timelier fashion. Spark Streaming’s ever-growing user base consists of household names like Uber, Netflix, and Pinterest.
Interestingly, the final remark is to combine Kafka and Spark. This could be interesting when we want to have real-time zoom speed (a second would already feel laggy) for the visualization and some latency for the ML model calculations:
Conclusion
Kafka Streams is still best used in a ‘Kafka -> Kafka’ context, while Spark Streaming could be used for a ‘Kafka -> Database’ or ‘Kafka -> Data science model’ type of context.
Although, when these 2 technologies are connected, they bring complete data collection and processing capabilities together and are widely used in commercialized use cases and occupy significant market share.
Simple start with MLib in Spark: calculate correlation between rows and columns, apply dimensionality reduction
https://stackoverflow.com/questions/30862662/which-spark-mlib-algorithm-to-use --> https://spark.apache.org/docs/1.4.0/mllib-statistics.html
Other simple start: kmeans https://stackoverflow.com/questions/34061958/apache-spark-mlib https://stackoverflow.com/questions/31996792/using-apache-spark-mlib-libraries-without-spark-console
Other interesting modules in MLib:
kmeans on Stack Overflow:
Spark ML algorithms are currently wrappers for MLlib algorithms, and the MLlib programming guide has details on specific algorithms.
spark.ml: high-level APIs for ML pipelines
Spark 1.2 introduced a new package called spark.ml, which aims to provide a uniform set of high-level APIs that help users create and tune practical machine learning pipelines. It is currently an alpha component, and we would like to hear back from the community about how it fits real-world use cases and how it could be improved.
Note that we will keep supporting and adding features to spark.mllib along with the development of spark.ml. Users should be comfortable using spark.mllib features and expect more features coming. Developers should contribute new algorithms to spark.mllib and can optionally contribute to spark.ml.
See the spark.ml programming guide for more information about this package and code examples for Transformers, Pipeline https://spark.apache.org/docs/1.4.0/ml-guide.html
BigQuery
BigQuery is a fully-managed, serverless data warehouse that enables scalable analysis over petabytes of data. It is a Platform as a Service (PaaS) that supports querying using ANSI SQL. It also has built-in machine learning capabilities.
BigQuery provides external access to Google's Dremel technology,[2][3] a scalable, interactive ad hoc query system for analysis of nested data. BigQuery requires all requests to be authenticated, supporting a number of Google-proprietary mechanisms as well as OAuth.
We follow the this tutorial on https://ondata.blog/articles/getting-started-apache-spark-pyspark-and-jupyter-in-a-docker-container/. It offers a short guide till the first query. It relies on a container that simply has Spark and PySpark in a Jupyter notebook which you can get with
docker pull jupyter/pyspark-notebook
, see the docker page: https://hub.docker.com/r/jupyter/pyspark-notebook/https://github.com/Learn-Apache-Spark/SparkML
is another docker solution to install spark and run a tutorial (though it crashed my internet). I could install it on my WSL2 on a separate disk using stand-alone docker on Ubuntu (VirtualBox or Docker Desktop would work as well).
Problem of that tutorial: https://github.com/Learn-Apache-Spark/SparkML#license
It is only for a given tutorial. Any other use should be reported. That is why the aim is is to get a more recent container of a Spark + ML in pyspark + ML on the spark db itself to make the model work without the tutorial container.
The main aim is to have a big data approach that automatically optimizes the queries (using catalyst in spark), sql queries can already be done well with postgreSQL.
Spark sql queries vs dataframe functions https://stackoverflow.com/questions/35222539/spark-sql-queries-vs-dataframe-functions#
Alternative:
There is also Apache Presto which is faster than Spark since it is focused on a database like SQL.
And:
It is unclear whether we should use a big data approach in the end. Since big data is about very large sizes (petabytes). Which we do not have. Still, the idea would be to show in an image "just what has been found quickly enough" by the big data search algorithm. That would mean: decrease the data quality but increase speed. As big data should have the skillset to quickly search through trees / hashmaps in parallel entities, it might speed up to simply use big data architecture with Spark's catalyzer that is there to optimize any query. If that does not work, we do not need big data, and we can go back to postgresql 13.