siradam / DataMining_Project

0 stars 1 forks source link

Optimisation: use Big Data architecture to replace relational DB #17

Open lorenzznerol opened 3 years ago

lorenzznerol commented 3 years ago

We follow the this tutorial on https://ondata.blog/articles/getting-started-apache-spark-pyspark-and-jupyter-in-a-docker-container/. It offers a short guide till the first query. It relies on a container that simply has Spark and PySpark in a Jupyter notebook which you can get with docker pull jupyter/pyspark-notebook, see the docker page: https://hub.docker.com/r/jupyter/pyspark-notebook/


https://github.com/Learn-Apache-Spark/SparkML

is another docker solution to install spark and run a tutorial (though it crashed my internet). I could install it on my WSL2 on a separate disk using stand-alone docker on Ubuntu (VirtualBox or Docker Desktop would work as well).

Problem of that tutorial: https://github.com/Learn-Apache-Spark/SparkML#license

It is only for a given tutorial. Any other use should be reported. That is why the aim is is to get a more recent container of a Spark + ML in pyspark + ML on the spark db itself to make the model work without the tutorial container.


The main aim is to have a big data approach that automatically optimizes the queries (using catalyst in spark), sql queries can already be done well with postgreSQL.

Spark sql queries vs dataframe functions https://stackoverflow.com/questions/35222539/spark-sql-queries-vs-dataframe-functions#

  1. answer

There is no performance difference whatsoever. Both methods use exactly the same execution engine and internal data structures. At the end of the day, all boils down to personal preferences.Arguably DataFrame queries are much easier to construct programmatically and provide a minimal type safety. Plain SQL queries can be significantly more concise and easier to understand. They are also portable and can be used without any modifications with every supported language. With HiveContext, these can also be used to expose some functionalities which can be inaccessible in other ways (for example UDF without Spark wrappers).

  1. answer

The only thing that matters is what kind of underlying algorithm is used for grouping. HashAggregation would be more efficient than SortAggregation. SortAggregation - Will sort the rows and then gather together the matching rows. O(n*log n) HashAggregation creates a HashMap using key as grouping columns where as rest of the columns as values in a Map. Spark SQL uses HashAggregation where possible(If data for value is mutable). O(n)

  1. answer

Spark's catalyzer should optimize both calls to the same execution plan and the performance should be the same. How to call is just a matter of your style. In reality, there is a difference accordingly to the report by Hortonworks (https://community.hortonworks.com/articles/42027/rdd-vs-dataframe-vs-sparksql.html ), where SQL outperforms Dataframes for a case when you need GROUPed records with their total COUNTS that are SORT DESCENDING by record name.


Alternative:

There is also Apache Presto which is faster than Spark since it is focused on a database like SQL.

Presto is an open source distributed SQL query engine for running interactive analytic queries against data sources of all sizes ranging from gigabytes to petabytes. Presto was designed and written from the ground up for interactive analytics and approaches the speed of commercial data warehouses while scaling to the size of organizations.

And:

Presto was originally designed and developed at Facebook for their data analysts to run interactive queries on its large data warehouse in Apache Hadoop. Before Presto, the data analysts at Facebook relied on Apache Hive for running SQL analytics on their multi-petabyte data warehouse.

It is unclear whether we should use a big data approach in the end. Since big data is about very large sizes (petabytes). Which we do not have. Still, the idea would be to show in an image "just what has been found quickly enough" by the big data search algorithm. That would mean: decrease the data quality but increase speed. As big data should have the skillset to quickly search through trees / hashmaps in parallel entities, it might speed up to simply use big data architecture with Spark's catalyzer that is there to optimize any query. If that does not work, we do not need big data, and we can go back to postgresql 13.

lorenzznerol commented 3 years ago

The following is from https://www.knowledgehut.com/blog/big-data/kafka-vs-spark:

bild

Apache Spark and Apache Kafka can be used to stream data (instead of just loading altogether). Spark is a standalone solution with an MLib library that is for ML tasks on top of the Spark architecture. This seems better suited for ML than Kafka at first sight, since Kafka is more about real-time processing, while Spark allows some latency. And we do not need the fastest processing time possible, we can have some latency if we calculate ML models.

bild

Kafka streams Use-cases:

Following are a couple of many industry Use cases where Kafka stream is being used:

The New York Times: The New York Times uses Apache Kafka and Kafka Streams to store and distribute, in real-time, published content to the various applications and systems that make it available to the readers.
Pinterest: Pinterest uses Apache Kafka and the Kafka Streams at large scale to power the real-time, predictive budgeting system of their advertising infrastructure. With Kafka Streams, spend predictions are more accurate than ever.
Zalando: As the leading online fashion retailer in Europe, Zalando uses Kafka as an ESB (Enterprise Service Bus), which helps us in transitioning from a monolithic to a micro services architecture. Using Kafka for processing event streams enables our technical team to do near-real time business intelligence.
Trivago: Trivago is a global hotel search platform. We are focused on reshaping the way travellers search for and compare hotels while enabling hotel advertisers to grow their businesses by providing access to a broad audience of travellers via our websites and apps. As of 2017, we offer access to approximately 1.8 million hotels and other accommodations in over 190 countries. We use Kafka, Kafka Connect, and Kafka Streams to enable our developers to access data freely in the company. Kafka Streams powers parts of our analytics pipeline and delivers endless options to explore and operate on the data sources we have at hand.

Broadly, Kafka is suitable for microservices integration use cases and have wider flexibility. Spark Streaming Use-cases:

Following are a couple of the many industries use-cases where spark streaming is being used:

Booking.com: We are using Spark Streaming for building online Machine Learning (ML) features that are used in Booking.com for real-time prediction of behaviour and preferences of our users, demand for hotels and improve processes in customer support. 
Yelp: Yelp’s ad platform handles millions of ad requests every day. To generate ad metrics and analytics in real-time, they built the ad event tracking and analyzing pipeline on top of Spark Streaming. It allows Yelp to manage a large number of active ad campaigns and greatly reduce over-delivery. It also enables them to share ad metrics with advertisers in a timelier fashion.
Spark Streaming’s ever-growing user base consists of household names like Uber, Netflix, and Pinterest.

Interestingly, the final remark is to combine Kafka and Spark. This could be interesting when we want to have real-time zoom speed (a second would already feel laggy) for the visualization and some latency for the ML model calculations:

Conclusion

Kafka Streams is still best used in a ‘Kafka -> Kafka’ context, while Spark Streaming could be used for a ‘Kafka -> Database’ or ‘Kafka -> Data science model’ type of context.

Although, when these 2 technologies are connected, they bring complete data collection and processing capabilities together and are widely used in commercialized use cases and occupy significant market share.

lorenzznerol commented 3 years ago

spark.MLib

Simple start with MLib in Spark: calculate correlation between rows and columns, apply dimensionality reduction

https://stackoverflow.com/questions/30862662/which-spark-mlib-algorithm-to-use --> https://spark.apache.org/docs/1.4.0/mllib-statistics.html

Other simple start: kmeans https://stackoverflow.com/questions/34061958/apache-spark-mlib https://stackoverflow.com/questions/31996792/using-apache-spark-mlib-libraries-without-spark-console

Other interesting modules in MLib:

kmeans on Stack Overflow:

New: spark.ml (but MLib is still the main library)

Spark ML algorithms are currently wrappers for MLlib algorithms, and the MLlib programming guide has details on specific algorithms.

spark.ml: high-level APIs for ML pipelines

Spark 1.2 introduced a new package called spark.ml, which aims to provide a uniform set of high-level APIs that help users create and tune practical machine learning pipelines. It is currently an alpha component, and we would like to hear back from the community about how it fits real-world use cases and how it could be improved.

Note that we will keep supporting and adding features to spark.mllib along with the development of spark.ml. Users should be comfortable using spark.mllib features and expect more features coming. Developers should contribute new algorithms to spark.mllib and can optionally contribute to spark.ml.

See the spark.ml programming guide for more information about this package and code examples for Transformers, Pipeline https://spark.apache.org/docs/1.4.0/ml-guide.html

lorenzznerol commented 3 years ago

Spatial data in Spark or GraphDB:

Spatial data management in apache spark: the GeoSpark perspective and beyond

https://db-engines.com/en/system/GraphDB%3BH2%3BSpark+SQL

https://stackoverflow.com/questions/68694946/geosparql-support-in-marklogic

lorenzznerol commented 3 years ago

BigQuery

BigQuery is a fully-managed, serverless data warehouse that enables scalable analysis over petabytes of data. It is a Platform as a Service (PaaS) that supports querying using ANSI SQL. It also has built-in machine learning capabilities.

BigQuery provides external access to Google's Dremel technology,[2][3] a scalable, interactive ad hoc query system for analysis of nested data. BigQuery requires all requests to be authenticated, supporting a number of Google-proprietary mechanisms as well as OAuth.

From: https://en.wikipedia.org/wiki/BigQuery