All of our scraping configurations are defined in the data/companies folder in JSON files. Each company has its own scraping config (which we have defined here). However, when we run the program, we need to validate that the company scraping config is valid before scraping begins. One such data validation library is called Cerberus, which is lightweight and easily configurable. The idea is that Cerberus will be able to validate the syntax/structure of the scraping config JSON so that we are guaranteed to be running a valid scrape (see their usage docs for more info on how it works).
[x] Look at companies that already have a complete scraping config and get a feel for how it works
[x] Install and add Cerberus as a dependency by adding Cerberus==1.3.5 to the requirements.txt file and run pip3 install -r requirements.txt
[x] Before the scraping begins, it will be known what companies are slated to be scraped; you will need to create a function / class for validating these scraping configurations for each company that will be scraped. This is as easy as running the Cerberus validator on each of the JSON files that will be scraped (see this guide on how to read JSON as dict)
[x] If a scraping configuration is invalid for a given company, log the error, skip that company, and continue scraping (obviously excluding the invalid company scraping config).
Notes
Some companies have a scrape: null key-value mapping; this means that we have not defined a scraping config for that company's job board yet, but are planning to. The validator should allow this.
Context
All of our scraping configurations are defined in the
data/companies
folder in JSON files. Each company has its own scraping config (which we have defined here). However, when we run the program, we need to validate that the company scraping config is valid before scraping begins. One such data validation library is called Cerberus, which is lightweight and easily configurable. The idea is that Cerberus will be able to validate the syntax/structure of the scraping config JSON so that we are guaranteed to be running a valid scrape (see their usage docs for more info on how it works).TODO
Cerberus==1.3.5
to therequirements.txt
file and runpip3 install -r requirements.txt
Notes
Some companies have a
scrape: null
key-value mapping; this means that we have not defined a scraping config for that company's job board yet, but are planning to. The validator should allow this.