scrapy / scrapyd

A service daemon to run Scrapy spiders
https://scrapyd.readthedocs.io/en/stable/
BSD 3-Clause "New" or "Revised" License
2.96k stars 569 forks source link

Use --rundir when parsing relative *_dir options to absolute paths #70

Closed sujaymansingh closed 3 months ago

sujaymansingh commented 9 years ago

Hi

I run scrapyd with the --rundir option. (version 1.0.1)

I have the following issue.

I suspect the issue is to do with when the directory is changed.

It seems that SpiderScheduler will load the eggs/projects when initialised https://github.com/scrapy/scrapyd/blob/1.0.1/scrapyd/scheduler.py#L12

But I think this is done before changing the working directory to --rundir.

To investigate, I added a couple of hacky print statements to SpiderScheduler

class SpiderScheduler(object):

    implements(ISpiderScheduler)

    def __init__(self, config):
        self.config = config
        import os; print "SpiderScheduler::__init__ current dir=" + os.getcwd()
        self.update_projects()

    def schedule(self, project, spider_name, **spider_args):
        q = self.queues[project]
        q.add(spider_name, **spider_args)

    def list_projects(self):
        import os; print "SpiderScheduler::list_projects current dir=" + os.getcwd()
        return self.queues.keys()

    def update_projects(self):
        self.queues = get_spider_queues(self.config)

And grepping logs for "SpiderScheduler" (after I restart and then make a call in my browser to listprojects.json)

SpiderScheduler::__init__ current dir=/opt/skuscraper
2014-12-11 15:22:40+0000 [HTTPChannel,0,10.10.9.220] SpiderScheduler::list_projects current dir=/var/scrapyd

So /opt/skuscraper is my project directory (with the scrapyd.conf). But I want the working directory to be separate (so it doesn't put any extra files in the app directory), that is why I use /var/scrapyd as the run dir.

We can see that when the SpiderScheduler object is init'd, the current dir is /opt/skuscraper, so it can't find any eggs. But after the app starts up, it uses /var/scrapyd.

So any deploys of eggs after the app starts up are saved to /var/scrapd/eggs, but then scrapyd is restarted, it loads its initial list of eggs from /opt/skuscraper (where they won't exist).

sujaymansingh commented 9 years ago

I guess that it is twisted that actually changes the working directory to --rundir ?

Digenis commented 8 years ago

See https://twistedmatrix.com/trac/ticket/2572 Does updating twisted solve this? You can work around it by using absolute paths for eggs_dir, logs_dir, dbs_dir (and items_dir if used).

If this is indeed caused by the above bug we can't wait for them to fix it because they first have to discuss and decide between the 2 behaviours. In this case it would worth a workaround, either overriding the default twisted app argument parsing or making scrapyd use the rundir option when preparing paths.

jpmckinney commented 3 months ago

Since this is the only issue report about --rundir in 10 years, I am simply removing it as an option. 9fa4091

If deployed using systemd, for example, WorkingDirectory= can be used, instead.