scrapy / scrapyd

A service daemon to run Scrapy spiders
https://scrapyd.readthedocs.io/en/stable/
BSD 3-Clause "New" or "Revised" License
2.93k stars 571 forks source link

Allow launcher to tracker processes across restarts #521

Open jpmckinney opened 2 months ago

jpmckinney commented 2 months ago

The default Launcher stores self.processes as a dict in memory, so when Scrapyd restarts, it loses track of the processes. The processes are stored as ScrapyProcessProtocol instances.

A new Launcher could perhaps store these in a backend, that it loads at startup.

Scrapyd only needs to restart rarely (e.g. version upgrade, configuration change, or server restart). If #519 is done, then configuration changes won't be a concern here.

jpmckinney commented 1 month ago

Having now read the code, the protocol stores a deferred, such that when the subprocess returns, it runs the _process_finished callback. Attaching this callback can be done on restart. The harder part is reestablishing whatever calls the protocol's outReceived, errReceived and processEnded methods (maybe originally done by reactor.spawnProcess?).

We would need to change the model significantly to resolve this issue. So, closing unless there are new ideas on how to resolve.

Edit: This was also discusssed starting from this comment https://github.com/scrapy/scrapyd/issues/12#issuecomment-667606979 and one idea was to somehow use Scrapy's JOBDIR logic - I haven't looked into it.

Anyway, I'll at least re-open in case anyone has new ideas, but for now this seems hard to resolve.

jpmckinney commented 1 month ago

Okay, so I did a little experimenting and wrote this small app that can run test files.

The test files are print1.py

import time

while True:
    print(1)
    time.sleep(1)

and printonce.py

print(1)

The app is:

import sys

from twisted.application.service import Application, Service
from twisted.internet import protocol, reactor
from twisted.python.runtime import platformType
from twisted.python import log
from twisted.scripts import twistd

from scrapyd.launcher import ScrapyProcessProtocol

if platformType == "win32":
    from twisted.internet._dumbwin32proc import Process
else:
    from twisted.internet.process import Process

def finish(proto):
    print("done")

class Launcher(Service):
    def startService(self):
        args = [sys.executable, "print1.py"]
        env = {}
        process = ScrapyProcessProtocol("p1", "s1", "j1", env, args)
        process.deferred.addBoth(finish)

        # reactor.spawnProcess(process, sys.executable, args=args, env=env)
        Process(reactor, sys.executable, args, env, path=None, proto=process)

def app():
    app = Application("test")
    launcher = Launcher()
    launcher.setServiceParent(app)
    return app

application = app()

And the app is run with:

twistd -n -y spawn.py

To explain:

Importing twisted.internet.reactor calls twisted.internet.default.install(), which in turn calls one of:

which all inherit from twisted.internet.posixbase.PosixReactorBase. This class implements spawnProcess, which, for Scrapyd, instantiates twisted.internet.process.Process if twisted.python.runtime.platformType == "posix", and twisted.internet._dumbwin32proc.Process if "win32" (that is, os.name in ("nt", "ce")).

Anyway, Process.__init__ always calls its _fork method, which always calls os.fork() or os.posix_spawnp to create a subprocess – so there doesn't seem to be an option to hook into a running process using Twisted. We would have to figure out our own way, while also reestablishing the logic for processEnded, etc.