Open samnissen opened 7 years ago
We would benefit from :max_pages
or :max_time
options, especially in development and test environments.
If the exception is raised then would you want the whole crawl to stop at that point?
I think you get the same from max_pages by submitting crawl_limit, there is also a crawl_limit_by_page boolean which i think is false by default. crawl_limit is the max number of urls, and if crawl_limit_by_page is set to true then the crawl_limit only applies to text/html content.
Like the idea of max_time though, hadn't thought of that before, thinking that would set a datetime and include that date into the within_crawl_limits to check if it has passed, so could also consume a stop_at datetime. max_time would just do the arithmetic for you.
Yes, I think raising the error, breaking, or returning should stop the crawl as the default.
Wasn't aware of the crawl_limit
– will check that out thank you.
As for max_time
, I'm thinking that would probably be an integer, whereas something like stop_at
could be a datetime.
Hello -- this looks like a great crawler, but I need a way, when crawling, to max-out crawl times on a per-url basis.
Because of that I recommend two features:
Actually raise exceptions
This would allow me to decide any arbitrary conditions upon which to stop crawling.
Encode crawl stop options
This would be a higher level way of enshrining these as features, and would be a lot cleaner overall.
Ideally
:max_time
would acceptDateTime
,Time
orInteger
objects, where the integer would represent seconds.I'm totally new to this project, so feel free to let me know if these are crazy requests. I'm happy to help make this too, if you can give me a pointer as to where this would start out.