scrapy / scrapyd

A service daemon to run Scrapy spiders
https://scrapyd.readthedocs.io/en/stable/
BSD 3-Clause "New" or "Revised" License
2.89k stars 570 forks source link

Starting scrapyd docker container with eggs included #234

Open VanDavv opened 7 years ago

VanDavv commented 7 years ago

Hi I've been experimenting a little with scrapyd on docker, and done the following:

At first glance, it looked like it's working image

but, when I wanted to make a scheduje.json post, it returned me an error

{"node_name": "295e305bea8e", "status": "error", "message": "Scrapy 1.4.0 - no active project\n\nUnknown command: list\n\nUse \"scrapy\" to see available commands\n"}

I could type anything into project and spider fields and the result was the same. How can I fix this issue?

amarynets commented 7 years ago

Hi, this message means as far as I remember that Scrapy project is not deploying. I'm not working with Docker, but could you just run command scrapyd in a container if it will help, this means that you have a problem with permission to file system. Also sometimes(I don't know why. I have this only on my local machine) if scrapyd service was stopped and restart, the project can display as active in 127.0.0.1:6800 port, but when you will try to run, you can get this error and if we restart server list of active project will be empty(Maybe it was my small knowledge)

VanDavv commented 7 years ago

I think it's rather because scrapyd, when making request to addversion, despite adding egg file to eggs_dir, is making some other stuff that activate the project. I have even seen this functions in code, but I'm not able to recreate them. Also, I tried searching in sqlite database which scrapyd use if there are some data about eggs, but unfortunatelly there wasn't any and I'm stuck

amarynets commented 7 years ago

I'd recommended you use scrapyd-client for deploy and after deploy run scrapyd server

VanDavv commented 7 years ago

scrapyd-client is good on small scale. I wan't to have docker image with eggs and daemon configurated, so that I can launch it right away, without using scrapyd-client or scrapyd API

Digenis commented 7 years ago

@VanDavv, the project name in "available projects" shouldn't have the version, python version and egg extension in the name. There's definitely something wrong there. We need more info. You configuration files, the commands you type and their output and the logs. Eventually, the docker image but unfortunately I don't have the time to dig that far right now.

VanDavv commented 7 years ago

@Digenis here are all the informations You requested. Also, the project name in available projects is the same as a name of the egg file stored in eggs_dir

Dockerfile

FROM python:3.6
MAINTAINER lpilatowski@teonite.com

RUN set -xe \
    && apt-get update \
    && apt-get install -y autoconf \
                          build-essential \
                          curl \
                          git \
                          libffi-dev \
                          libssl-dev \
                          libtool \
                          libxml2 \
                          libxml2-dev \
                          libxslt1.1 \
                          libxslt1-dev \
                          python \
                          python-dev \
                          vim-tiny \
    && apt-get install -y libtiff5 \
                          libtiff5-dev \
                          libfreetype6-dev \
                          libjpeg62-turbo \
                          libjpeg62-turbo-dev \
                          liblcms2-2 \
                          liblcms2-dev \
                          libwebp5 \
                          libwebp-dev \
                          zlib1g \
                          zlib1g-dev \
    && curl -sSL https://bootstrap.pypa.io/get-pip.py | python \
    && pip install git+https://github.com/scrapy/scrapy.git \
                   git+https://github.com/scrapy/scrapyd.git \
                   git+https://github.com/scrapy/scrapyd-client.git \
                   git+https://github.com/scrapinghub/scrapy-splash.git \
                   git+https://github.com/scrapinghub/scrapyrt.git \
                   git+https://github.com/python-pillow/Pillow.git \
    && curl -sSL https://github.com/scrapy/scrapy/raw/master/extras/scrapy_bash_completion -o /etc/bash_completion.d/scrapy_bash_completion \
    && curl -sL https://deb.nodesource.com/setup_6.x | bash - \
    && apt-get install -y nodejs \
    && echo 'source /etc/bash_completion.d/scrapy_bash_completion' >> /root/.bashrc \
    && apt-get purge -y --auto-remove autoconf \
                                      build-essential \
                                      libffi-dev \
                                      libssl-dev \
                                      libtool \
                                      libxml2-dev \
                                      libxslt1-dev \
                                      python-dev \
    && apt-get purge -y --auto-remove libtiff5-dev \
                                      libfreetype6-dev \
                                      libjpeg62-turbo-dev \
                                      liblcms2-dev \
                                      libwebp-dev \
                                      zlib1g-dev \
    && rm -rf /var/lib/apt/lists/*

RUN npm install -g phantomjs-prebuilt

COPY ./scrapyd.conf /etc/scrapyd/
VOLUME /etc/scrapyd/ /var/lib/scrapyd/
EXPOSE 6800

ADD requirements.txt .
RUN pip install -r requirements.txt
ADD . .
ADD eggs /src/eggs

CMD ["scrapyd", "--pidfile="]

Config file

[scrapyd]
eggs_dir          = /src/eggs
logs_dir          = /var/lib/scrapyd/logs
items_dir         = /var/lib/scrapyd/items
dbs_dir           = /var/lib/scrapyd/dbs
jobs_to_keep      = 5
max_proc          = 0
max_proc_per_cpu  = 4
finished_to_keep  = 100
poll_interval     = 5
bind_address      = 0.0.0.0
http_port         = 6800
debug             = off
runner            = scrapyd.runner
application       = scrapyd.app.application
launcher          = scrapyd.launcher.Launcher

[services]
schedule.json     = scrapyd.webservice.Schedule
cancel.json       = scrapyd.webservice.Cancel
addversion.json   = scrapyd.webservice.AddVersion
listprojects.json = scrapyd.webservice.ListProjects
listversions.json = scrapyd.webservice.ListVersions
listspiders.json  = scrapyd.webservice.ListSpiders
delproject.json   = scrapyd.webservice.DeleteProject
delversion.json   = scrapyd.webservice.DeleteVersion
listjobs.json     = scrapyd.webservice.ListJobs
daemonstatus.json = scrapyd.webservice.DaemonStatus

Eggs are built in separate step (which is tested and works ok), and we can assume that I have couple of egg files created with python setup.py bdist_egg command on some scrapy projects, stored in eggs directory. Container is then simply runned by docker run scrapy-deamon-eggs. Logs from scrapyd when ran:

2017-07-02T20:40:16+0000 [-] Loading /usr/local/lib/python3.6/site-packages/scrapyd/txapp.py...
2017-07-02T20:40:16+0000 [-] Scrapyd web console available at http://0.0.0.0:6800/
2017-07-02T20:40:16+0000 [-] Loaded.
2017-07-02T20:40:16+0000 [twisted.scripts._twistd_unix.UnixAppLogger#info] twistd 17.1.0 (/usr/local/bin/python 3.6.1) starting up.
2017-07-02T20:40:16+0000 [twisted.scripts._twistd_unix.UnixAppLogger#info] reactor class: twisted.internet.epollreactor.EPollReactor.
2017-07-02T20:40:16+0000 [-] Site starting on 6800
2017-07-02T20:40:16+0000 [twisted.web.server.Site#info] Starting factory <twisted.web.server.Site object at 0x7f3f293cde48>
2017-07-02T20:40:16+0000 [Launcher] Scrapyd 1.2.0 started: max_proc=32, runner='scrapyd.runner'
omrihar commented 6 years ago

@VanDavv Do you have a nice solution for this problem by now? I'm also interested in deploying scrapyd using Docker and even though I only have one scraper to deploy, I would much prefer to have everything built locally and sent to AWS in one nice package, rather than having to upload the docker image first and then use scrapyd-client to deploy my scraper.

VanDavv commented 6 years ago

@omrihar I abandoned this project, as far as I could get was to include eggs in the image and after scapyd startup upload them via scrapyd-client.

Other solution - launch scrapyd, upload spiders, then make docker commit and push this image also worked, but it wasn't what I wanted

Maybe @Digenis could help us handling this case (and maybe remove insufficient info label?)

radyz commented 5 years ago

I managed to get through this by running a background deploy after my scrapyd instance has started. Not sure it's the best way but it works for me now

Dockerfile

FROM python:3.6

COPY requirements.txt /requirements.txt
RUN pip install -r requirements.txt
COPY docker-entrypoint /usr/local/bin/
RUN chmod 0755 /usr/local/bin/docker-entrypoint
COPY . /scrapyd
WORKDIR /scrapyd

ENTRYPOINT ["/usr/local/bin/docker-entrypoint"]

Entrypoint script

#!/bin/bash
bash -c 'sleep 15; scrapyd-deploy' &
scrapyd

scrapy.cfg

[settings]
default = scraper.settings

[deploy]
url = http://localhost:6800
project = projectname

This assumes you are copying your scrapy project folder into /scrapyd and have the requirements.tx with all your dependencies (including scrapyd server)

iamprageeth commented 4 years ago

After reading the comment of @radyz , i could also run a container with a deployed a spider in following way.

Dockerfile :

FROM vimagick/scrapyd:py3
COPY myspider /myspider/
COPY entrypoint1.sh /myspider
COPY entrypoint2.sh /myspider
COPY wrapper.sh /myspider
RUN chmod +x myspider/entrypoint1.sh
RUN chmod +x myspider/entrypoint2.sh
RUN chmod +x myspider/wrapper.sh
WORKDIR /myspider
CMD  ./wrapper.sh

wrapper.sh :

#!/bin/bash

# turn on bash's job control
set -m

# Start the primary process and put it in the background
./entrypoint1.sh &

# Start the helper process
./entrypoint2.sh

# the my_helper_process might need to know how to wait on the
# primary process to start before it does its work and returns

# now we bring the primary process back into the foreground
# and leave it there
fg %1

entrypoint1.sh :

scrapyd

entrypoint2.sh :

sleep 15;scrapyd-deploy

My scrapyd project resides in the myspider folder.

Refer : https://docs.docker.com/config/containers/multi-service_container/

jacob1237 commented 4 years ago

@VanDavv @iamprageeth @radyz

I managed to solve the problem without using the API. Unfortunately, there is no way to deploy Scrapy projects without the egg files completely (the only way is to override some scrapyd components), so you'll need a simple deployment script:

build.sh:

#!/bin/sh

set -e

# The alternative way to build eggs is to use setup.py
# if you already have it in the Scrapy project's root
scrapy-deploy --build-egg=myproject.egg

# your docker container build commands
# ...

Dockerfile:

RUN mkdir -p eggs/myproject
COPY myproject.egg eggs/myproject/1_0.egg

CMD ["scrapyd"]

That's all! So instead of deploying myproject.egg into the eggs folder directly, you have to create the following structure: eggs/myproject/1_0.egg where myproject is your project name, and 1_0 is a version of your project in scrapyd

dark-necron commented 3 years ago

Experimenting with above approach I ended up with two-step build. First step is used to build the egg without installing unnecessary scrapyd-client to final container. The resulting image with alpine as base is about 100 Mb.

FROM python as builder

RUN pip install scrapyd-client

WORKDIR /build

COPY . .

RUN scrapyd-deploy --build-egg=scraper.egg

FROM python:alpine

RUN apk add --update --no-cache --virtual .build-deps \
      gcc \
      libffi-dev \
      libressl-dev \
      libxml2 \
      libxml2-dev \
      libxslt-dev \
      musl-dev \
    && pip install --no-cache-dir \
      scrapyd \
    && apk del .build-deps \
    && apk add \
      libressl \
      libxslt

VOLUME /etc/scrapyd/ /var/lib/scrapyd/

COPY ./scrapyd.conf /etc/scrapyd/

RUN mkdir -p /src/eggs/scraper
COPY --from=builder /build/scraper.egg /src/eggs/scraper/1_0.egg

EXPOSE 6800

ENTRYPOINT ["scrapyd", "--pidfile="]

Not fully tested yet, but seems operational.