This is a boilerplate for new Scrapy projects.
The project is a WIP, so expect major changes and additions (mostly latter). Master branch is to be considered as always ready to use, with major changes/features introduced in feature branches.
.env
fileTo create and run a new Scrapy project using this boilerplate, you need to:
cp .env.example .env
cd src/python/src
poetry install
poetry shell
scrapy
docker compose up -d database python
docker compose exec python bash
cd /var/app/python/src/
poetry shell
scrapy
The project includes Dockerfiles and docker-compose configuration for running your spiders in containers.
Also, a configuration for default RabbitMQ server is included.
Dockerfiles are located inside the docker
subdirectory, and the docker-compose.yml
- at the root of the project.
Docker-compose takes configuration values from ENV. Environment can also be provided by creating a .env
file at the root of the project (see .env.example
as a sample).
A scrapy downloader middleware to use a proxy server is included in src/middlewares/HttpProxyMiddleware.py
and is enabled by default. You can use it by providing proxy endpoint with the env variable (or in the .env
file) PROXY
in the format host:port
. Proxy authentication can also be provided in the PROXY_AUTH
variable, using the format user:password
. If provided, it is encoded as a Basic HTTP Auth and put into Proxy-Authorization
header.
A single-endpoint proxy is used by default, assuming usage of rotating proxies service. If you want to provide your own list of proxies, an external package has to be used, as this use-case is not yet covered by this boilerplate.
This boilerplate offers a more intuitive alternative to Scrapy's default project structure. Here, file/directory structure is more flattened and re-arranged a bit.
src/python/src
subdirectory (without any subdirs with project name, contrary to default).items.py, middlewares.py, pipelines.py
) are converted to sub-modules, where each class is placed in its own separate file. Nothing else goes into those files.scrapy.cfg
and settings.py
are edited to correspond with these changes.src/python/src/database
), RabbitMQ (src/python/src/rmq
)