Closed rlskoeser closed 3 months ago
The documentation provides syntax for stopping a crawl https://spider-rs.github.io/spider-py/website.html#stop-crawl
Could you provide an example of how to bind a signal handler and/or catch
KeyboardInterrupt
to stop a crawl?
If the program is terminated the crawl will stop. It does not spawn processes.
That's not the behavior I'm seeing, which is why I asked. Is that still true when you have a subscription? Or maybe it's something else I'm doing?
Here's the current version of my script: https://github.com/Princeton-CDH/caliper-scrapy/blob/spider-rs-rebuild/src/caliper/crawl.py
I found some examples for interrupt handlers with asyncio and tried them. When I run this script on a small static site, it seemed like it was interruptable but it crawls the site SO quickly it's hard to interrupt it fast enough. When I run it on a larger, slower site, I can't interrupt it. The only way I've found to stop it from the terminal is to suspend it and then kill it.
Use website.unsubscribe as well to close the channel down with subscriptions since stop prevents the crawl from proceeding but, not if there's an isolated concurrent work being done on a green thread.
There should be no processes - not sure if pyo3 is doing anything in that manner.
Use website.unsubscribe as well to close the channel down with subscriptions since stop prevents the crawl from proceeding but, not if there's an isolated concurrent work being done on a green thread.
There should be no processes - not sure if pyo3 is doing anything in that manner.
Let me know if you see a process spawn. If so that is incorrect since everything should be green/native threads. There's no std::process being used. As soon as the main program terminates everything should stop.
Well, I found one thing that may be relevant: as of python 3.11 (I've been using 3.12), the asyncio runner installs a custom handler for SIGINT
: https://docs.python.org/3/library/asyncio-runner.html#handling-keyboard-interruption
There's an example for using it with Task.cancel
but I haven't figured out how that works when using asyncio.run
.
I tried to test on earlier versions of python, but I'm getting other errors that I don't think are related to this.
But maybe the custom handler isn't relevant since we're not using async.Runner
.... LMK if you have any ideas about how I might investigate.
I finally found a solution that works for my script:
try:
asyncio.run(main())
except KeyboardInterrupt:
raise SystemExit
The documentation provides syntax for stopping a crawl https://spider-rs.github.io/spider-py/website.html#stop-crawl
Could you provide an example of how to bind a signal handler and/or catch
KeyboardInterrupt
to stop a crawl?