soufianeodf / youtube-divolte-kafka-druid-superset

A proof of concept about collecting real-time clickstream data using Javascript, Divolte Collector, Apache Kafka, Kafka Streams, Apache Druid and Apache Superset.
MIT License
4 stars 0 forks source link
ansible divolte-collector druid kafka kafka-manager kafka-streams selenium superset swarm

version Twitter

Real-time Clickstream analysis

A proof of concept about collecting real-time clickstream data using Javascript, Divolte Collector, Apache Kafka, Kafka Streams, Apache Druid and Apache Superset.

At the end of the youtube video attached here, we are going to compare our results with Microsoft Clarity and Google Analytics. The comparison is going to be just for fun, as those platforms are complete products and built for years by big companies.

Youtube video

Demo

(this demo may not be available after some time, due to cloud infrastructure costs)

You can visit the website as a client, and then go to Apache superset dashboard to see real-time results.

Apache Superset dashboard credentials:

username: admin
password: admin

Architecture Diagram

alt text

Dashboard

alt text

Technologies Used

Youtube videos I made on clickstream data collection

Requirements

Getting Started

Clone repository

git clone https://github.com/soufianeodf/youtube-divolte-kafka-druid-superset.git

cd youtube-divolte-kafka-druid-superset

Website

Divolte Collector

You can modify divolte-collector config files and adapt them to your needs:

Zookeeper, Apache Kafka and Kafka Manager

You can control all config variables of Zookeeper, Apache Kafka and Kafka Manager from docker-compose.yml.

Kafka Streams

You can modify Kafka Streams variable from application.properties file.

Make sure that the avro file is them same as the one you have in Divolte Collector server.

Don't forget to generate java .jar after you make any change.

Apache Druid

You can modify the Apache Druid config file if you want.

After running Apache Druid, to filter payloads having null as country value, we use the following:

{
   "type":"not",
   "field":{
      "type":"selector",
      "dimension":"country",
      "value":null
   }
}

Apache Superset

superset.sh is the file responsible for setting the username and password of Apache Superset dashboard and more, make sure you execute it after Apache Superset is up and running.

In order for Apache Superset to use maps, it's using Mapbox under the hood, so for that, you need to set up the mapbox key in the config file:

MAPBOX_API_KEY = "you_mapbox_token"

After running Apache Superset, to connect to Apache Druid:

druid://<User>:<password>@<Host>:<Port-default-8888>/druid/v2/sql

Docker

You need to build your images and push them to your docker hub repository, because docker swarm suppose that the images are already built and exists in a docker registry.

Adapt docker-compose.yml to your needs, and then build and push the images to your docker hub repository as bellow:

docker-compose build
docker-compose push

Deploy on DigitalOcean with Ansible

Ansible project is highly inspired from pg3io/ansible-do-swarm, shout-out to him.

The ansible playbook is doing the following tasks:

Playbook Variables

All variables of the playbook can be found in vars.yml


Run

cd ansible/

ansible-playbook do-swarm.yml -e do_token="<DO TOKEN>"

Troubleshooting

Apache Superset

Issue: Unexpected Exception: name 'basestring' is not defined when invoking ansible2

Solution: pip uninstall dopy and pip3 install git+https://github.com/eodgooch/dopy@0.4.0#egg=dopy

Issue: The CSRF session token is missing

Solution: set up this property WTF_CSRF_ENABLED = False in config file

Website visits simulation with Selenium

In the video, I have simulated with a Selenium tool, visits to the website from different browsers, Operating systems and countries as described in the image bellow, to check if our clickstream solution we built is able to intercept those hits accurately:

disclosure

The Selenium tool that simulate website user visits is private at this moment because it's still in the development phase, it will be public as soon as it's completed.

License

Licensed under the MIT License.