spider-rs / spider-py

Spider ported to Python
https://spider-rs.github.io/spider-py/
MIT License
37 stars 3 forks source link

Example interrupt handler to stop crawl #7

Closed rlskoeser closed 2 months ago

rlskoeser commented 2 months ago

The documentation provides syntax for stopping a crawl https://spider-rs.github.io/spider-py/website.html#stop-crawl

Could you provide an example of how to bind a signal handler and/or catch KeyboardInterrupt to stop a crawl?

j-mendez commented 2 months ago

The documentation provides syntax for stopping a crawl https://spider-rs.github.io/spider-py/website.html#stop-crawl

Could you provide an example of how to bind a signal handler and/or catch KeyboardInterrupt to stop a crawl?

If the program is terminated the crawl will stop. It does not spawn processes.

rlskoeser commented 2 months ago

That's not the behavior I'm seeing, which is why I asked. Is that still true when you have a subscription? Or maybe it's something else I'm doing?

Here's the current version of my script: https://github.com/Princeton-CDH/caliper-scrapy/blob/spider-rs-rebuild/src/caliper/crawl.py

I found some examples for interrupt handlers with asyncio and tried them. When I run this script on a small static site, it seemed like it was interruptable but it crawls the site SO quickly it's hard to interrupt it fast enough. When I run it on a larger, slower site, I can't interrupt it. The only way I've found to stop it from the terminal is to suspend it and then kill it.

j-mendez commented 2 months ago

Use website.unsubscribe as well to close the channel down with subscriptions since stop prevents the crawl from proceeding but, not if there's an isolated concurrent work being done on a green thread.

There should be no processes - not sure if pyo3 is doing anything in that manner.

j-mendez commented 2 months ago

Use website.unsubscribe as well to close the channel down with subscriptions since stop prevents the crawl from proceeding but, not if there's an isolated concurrent work being done on a green thread.

There should be no processes - not sure if pyo3 is doing anything in that manner.

Let me know if you see a process spawn. If so that is incorrect since everything should be green/native threads. There's no std::process being used. As soon as the main program terminates everything should stop.

rlskoeser commented 2 months ago

Well, I found one thing that may be relevant: as of python 3.11 (I've been using 3.12), the asyncio runner installs a custom handler for SIGINT: https://docs.python.org/3/library/asyncio-runner.html#handling-keyboard-interruption

There's an example for using it with Task.cancel but I haven't figured out how that works when using asyncio.run.

I tried to test on earlier versions of python, but I'm getting other errors that I don't think are related to this.

But maybe the custom handler isn't relevant since we're not using async.Runner.... LMK if you have any ideas about how I might investigate.

rlskoeser commented 2 months ago

I finally found a solution that works for my script:

try:
    asyncio.run(main())
except KeyboardInterrupt:
    raise SystemExit