As part of DiffScraper, one or more bots can be deployed. Ready-to-use bots are provided that can extract offers from mobile applications, mobile websites and desktop websites.
DiffScraper bots need a connection to a central Redis database (to receive jobs from the controller and synchronize with other bots). In addition, all bots need a connection to a central FTP server (to upload screenshots). Redis and FTP connection details can be set in environmental variables as defined in .env.
On every machine where bots will be deployed, the following requirements:
adb devices
should return at least one device)DiffScraper Bot can be installed on any enviroment where Docker containers can be deployed. In this guide, I describe how to deploy bots, specialized in desktop web scraping, on a Kubernetes cluster. I also describe how to deploy one bot specialized in mobile phone scraping on a dedicated Windows machine.
Deploying a desktop website bot is the easiest, since no extra depencies are needed (e.g. a headful Chrome browser is included in the Dockerfile). To deploy desktop website bots on a Kubernetes cluster, run the following command (see webscraper-kubernetes-deployment.yaml):
kubectl --kubeconfig="my-kubeconfig.yaml" apply -f webscraper-kubernetes-deployment.yaml
This command will launch a ReplicaSet with two bots, and automatically connect to Redis to start processing scraping jobs.
To help in tracking scraping errors, logs of every bot can be centralized in ElasticSearch. A Kubernetes script is provided to send all bot logs to an ElasticSearch server:
kubectl --kubeconfig="my-kubeconfig.yaml" create -f filebeat-kubernetes.yaml
Deploying a mobile application web bot on a on-premise machine can be done with the provided docker-compose
files. For example, to start a mobile application bot that connects to a real device smartphone, use docker-compose up -f docker-compose.scraper.realdevice.yml
(see docker-compose.scraper.realdevice.yml). It is assumed that an Appium server is started and listing for connections on the IP address specified by the APPIUM_HOST in the .env file.
All logs will be sent to the ElasticSearch server specified in filebeat.yml.
In a production system, a bot receives scraping jobs via a central Redis queue from the controller.
To test a (new) bot, it is possible to bypass the job queue and test the bot locally from within its Docker container.
For example, to start scraping offers from the French website of Opodo:
Enter into a running bot container:
kubectl --kubeconfig="my-kubeconfig.yaml" exec --stdin --tty webscraper-deployment--1 -- /bin/bash
This command will extract all offers from opodo.fr:
ts-node cli.ts scrape OpodoWebScraper inputData.json --lang=fr