A relatively simple amazon.com crawler written in python. It has the following features:
It was used to pull over 1MM+ products and their images from amazon in a few hours. [Read more]().
After you get a copy of this codebase pulled down locally (either downloaded as a zip or git cloned), you'll need to install the python dependencies:
pip install -r requirements.txt
Then you'll need to go into the settings.py
file and update a number of values:
Once you've updated all of your connection information, you'll need to run the following at the command line to setup the postgres table that will store the product records:
python models.py
The fields that are stored for each product are the following:
You begin the crawler for the first time by running:
python crawler.py start
This runs a function that looks at all of the category URLs stored in the start-urls.txt
file, and then explodes those out into hundreds of subcategory URLs it finds on the category pages. Each of these subcategory URLs is placed in the redis queue that holds the frontier listing URLs to be crawled.
Then the program spins up the number of threads defined in settings.max_threads
and each one of those threads pops a listing URL from the queue, makes a request to it and then stores the (usually) 10-12 products it finds on the listing page. It also looks for the "next page" URL and puts that in the queue.
If you're restarting the crawler and don't want it to go back to the beginning, you can simply run it with
python crawler.py
This will skip the step of populating the URL queue with subcategory links, and assumes that there are already URLs stored in redis from a previous instance of the crawler.
This is convenient for making updates to crawler or parsing logic that only affect a few pages, without going back to the beginning and redoing all of your previous crawling work.
If you'd like to redirect the logging output into a logfile for later analysis, run the crawler with:
python crawler.py [start] > /var/log/crawler.log
Amazon uses many different styles of markup depending on the category and product type. This crawler focused mostly on the "Music, Movies & Games" category as well as the "Sports & Outdoors" category.
The extractors for finding product listings and their details will likely need to be changed to crawl different categories, or as the site's markup changes over time.