A proof of concept about collecting real-time clickstream data using Javascript, Divolte Collector, Apache Kafka, Kafka Streams, Apache Druid and Apache Superset.
At the end of the youtube video attached here, we are going to compare our results with Microsoft Clarity and Google Analytics. The comparison is going to be just for fun, as those platforms are complete products and built for years by big companies.
You can visit the website as a client, and then go to Apache superset dashboard to see real-time results.
Apache Superset dashboard credentials:
username: admin
password: admin
git clone https://github.com/soufianeodf/youtube-divolte-kafka-druid-superset.git
cd youtube-divolte-kafka-druid-superset
divolte-ip-address
value by the ip-address or DNS of your divolte server in index.html.You can modify divolte-collector config files and adapt them to your needs:
You can control all config variables of Zookeeper, Apache Kafka and Kafka Manager from docker-compose.yml.
You can modify Kafka Streams variable from application.properties file.
Make sure that the avro file is them same as the one you have in Divolte Collector server.
Don't forget to generate java .jar after you make any change.
You can modify the Apache Druid config file if you want.
After running Apache Druid, to filter payloads having null as country value, we use the following:
{
"type":"not",
"field":{
"type":"selector",
"dimension":"country",
"value":null
}
}
superset.sh is the file responsible for setting the username and password of Apache Superset dashboard and more, make sure you execute it after Apache Superset is up and running.
In order for Apache Superset to use maps, it's using Mapbox under the hood, so for that, you need to set up the mapbox key in the config file:
MAPBOX_API_KEY = "you_mapbox_token"
After running Apache Superset, to connect to Apache Druid:
druid://<User>:<password>@<Host>:<Port-default-8888>/druid/v2/sql
You need to build your images and push them to your docker hub repository, because docker swarm suppose that the images are already built and exists in a docker registry.
Adapt docker-compose.yml to your needs, and then build and push the images to your docker hub repository as bellow:
docker-compose build
docker-compose push
Ansible project is highly inspired from pg3io/ansible-do-swarm, shout-out to him.
The ansible playbook is doing the following tasks:
All variables of the playbook can be found in vars.yml
cd ansible/
ansible-playbook do-swarm.yml -e do_token="<DO TOKEN>"
Issue: Unexpected Exception: name 'basestring' is not defined when invoking ansible2
Solution: pip uninstall dopy
and pip3 install git+https://github.com/eodgooch/dopy@0.4.0#egg=dopy
Issue: The CSRF session token is missing
Solution: set up this property WTF_CSRF_ENABLED = False
in config file
In the video, I have simulated with a Selenium tool, visits to the website from different browsers, Operating systems and countries as described in the image bellow, to check if our clickstream solution we built is able to intercept those hits accurately:
The Selenium tool that simulate website user visits is private at this moment because it's still in the development phase, it will be public as soon as it's completed.
Licensed under the MIT License.