spatie / crawler

An easy to use, powerful crawler implemented in PHP. Can execute Javascript.
https://freek.dev/308-building-a-crawler-in-php
MIT License
2.53k stars 358 forks source link

Allow for chunked crawls by not running the queue empty (fixes #325) #331

Closed spekulatius closed 3 years ago

spekulatius commented 4 years ago

Hello @freekmurze,

as described in #325, I've noticed some limitations when crawling larger sites in chunks. I've decided to look into what it takes to solve this. This is the outcome.

It introduces one major change:

Would be awesome if you could review this and let me know if you foresee any issues or want changes. Thank you in advance!

Cheers, Peter

Redominus commented 4 years ago

There is no need to modify the Core Crawler to achieve this logic. You could implement your own queue that has a limit and when the cralwer calls to hasPendingUrls() after reaching that limit you just return false.

spekulatius commented 4 years ago

Hey @Redominus

yeah, I see what you mean. I find it odd to manipulate the crawling behaviour using a queue, which I see more as a helper/adapter to storage/persistence layers - not control logic. The point of the max count in the crawler wouldn't make much sense then too IMHO. Just my impression from how I would approach it.

Cheers, Peter

Redominus commented 4 years ago

Right now the only place to stop the crawling is the CrawlQueue. You could inject the repository in the constructor of the CrawlQueue class and let it be a sort of manager of the crawling process. Ie: Limit by number of already crawled or already added, stop crawling gracefuly by capturing POSIX signals, read status from the proxies system to stop crawling under certain circumstances. If we start adding all the possible cases to the crawler it will become bloated.

spekulatius commented 4 years ago

Yes, I see what you mean - that's what my PR is about: tweaking the core behavior.

20 lines coming in (including some comments and readme) while 18 lines go out isn't too bad I think. I also think it makes the crawler more logical and structured than stopping the crawl indirectly from the queue.

Redominus commented 4 years ago

Oh my.. I misunderstood the changes you are pulling. You are right. Sorry.

freekmurze commented 4 years ago

The crawl count limit will be checked while the requests are generated and processed, instead of when adding new URLs. This avoids the queue "running out". Meaning, the call queue being continued to fill with the discovered pages just no new requests will be generated after the max crawl count is reached. This allows to reuse the crawl queue directly.

Could you clarify this with a concrete example? I don't understand "This avoids the queue "running out" and "This allows to reuse the crawl queue directly."

spekulatius commented 3 years ago

Hey @freekmurze,

Thank you for taking time to look at this, highly appreciated!

To your questions:

Could you clarify this with a concrete example? I don't understand "This avoids the queue "running out" and "This allows to reuse the crawl queue directly."

This avoids the queue "running out"

Currently the crawler stops filling in the once the maxCrawlCount is reached. As no new links are added, this stops the crawler indirectly by running the queue empty. The only way to crawl large sites (without having a days-long running processes) is by manipulating the queue behaviour as @Redominus pointed out. This moves part of the control logic into the queue driver. It's kinda like turning of the petrol on a car.

The change suggested migrates the maximal count checks into the crawler. Allowing to define a limit for the current crawl using maxCrawlCount with the crawler. It avoids re-implementation of the logic in the queue keeping the queue-driver single purpose. Swapping out the queue-drivers without having to implement any logic is possible now (e.g. locally with redis and on prod with AWS SQS).

This allows to reuse the crawl queue directly.

This assumes using some form of batched processing (e.g. Laravel batches) to avoid the original PHP process running too long and stores the queue in-between. At the moment, we would have to manipulate the queue every time to allow for another round of crawling somehow (e.g. resetting an internal count or take the information and instantiate a new copy of the queue).

If the crawling limits are controlled by the crawler itself the last step wouldn't be needed and we could pass the queue to the crawler many times and simply continue crawling where the crawler left off last time.

Side Note on Crawling Limits

I can see how we might have slightly different interpretations of the maxCrawlCount. I assume you see it as a total crawl count while I read it as limit for the current execution of the crawler. It might makes sense to have it separated into two limits to control the crawler more granular and allow for batched crawling.

Let me know what you think about the points above.

Cheers, Peter

freekmurze commented 3 years ago

@spekulatius Thanks for explaining this.

To me it makes sense now that this change improves behaviour. I do think that this is a breaking change. So we need to tag a new major version, which I'm willing to do.

The reason why you want to change behaviour is because to make it easier to somehow stop and restart the crawler with a current state, right? Are there any things that you see that could be improved to make this even more easy? Breaking changes are allowed. This would make the crawler easier to use on time restricted environments such as AWS Lambda.

Regarding "Side Note on Crawling Limits", we could indeed have two separate settings for this. They need to be named well.

I highly value good docs, so anything we add or change must have top notch docs.

@spekulatius feel free to continue working on this in this PR.

spekulatius commented 3 years ago

Hey @freekmurze

Awesome, happy to hear you like where it's going! Yes, the change would alter the API and functionality of the package and therefore require a major release.

Exactly, the goal is to simplify the start+stop of the crawler. I think an example will help to see how it works. I would suggest the following enhancements for this PR:

Going further: Another idea to consider would be an unified configuration object or array. Then only two things are passed to the crawler: the queue-driver and the configuration. In the long run this could be taken further and build the base for an URL-based configuration approach for distributed crawlers. Imagine having a number of crawler nodes "waiting" as docker containers. Each one could be triggered using the config-URL as parameter. This might exceed the scope of the package tho and would be better done separately I guess. Especially as you would probably want to include something like serverless or bref to configure this automatically. The thinking on this isn't concluded as you see. What is your feeling of having an configuration array for now?

Let me know what if you think the two points above aren't covering it or need tweaking.

Cheers, Peter

Redominus commented 3 years ago

@spekulatius

If the idea is to pass a configuration object to the crawler. A Url-based configuration is just an implementation of a Cralwer configuration repository. Another person would want to load it from a DB or a File. In my opinion, the configuration hydration should be something external to the project. Am I understanding It correctly?

Redominus commented 3 years ago

@spekulatius One more idea: Passing to the crawler a KeepCrawlingServie. The counting could be made in a observer. It would allow more complex logic to stop the crawler.

spekulatius commented 3 years ago

@spekulatius

If the idea is to pass a configuration object to the crawler. A Url-based configuration is just an implementation of a Cralwer configuration repository. Another person would want to load it from a DB or a File. In my opinion, the configuration hydration should be something external to the project. Am I understanding It correctly?

True - the config-by-URL is just one possible way. Passing an array to the constructor would be enough for the crawler to allow easier configuration for now. The actual detail implementation of how and where the array comes from is something I would leave out too (as hinted with the "out of scope" part at the end).

@spekulatius One more idea: Passing to the crawler a KeepCrawlingServie. The counting could be made in a observer. It would allow more complex logic to stop the crawler.

Yeah, a service to handle logic is something to consider. It would take out some logic from the crawler - which isn't too bad at the moment I would say.

spekulatius commented 3 years ago

Hey @freekmurze

as discussed, I've added support for the total and current crawl limits. It would be great if you could let me know if this works for you?

Cheers, Peter

freekmurze commented 3 years ago

Thanks for updating the readme. I think the difference between the two limits is too vague for newcomers. The notion that this lays the base for chucked crawling could be confusing too.

Could you try to improve that by adding a few examples?

If you're up to it, you may also implement the logic needed to start the crawler using a certain state (continuing from a past, incomplete crawl), and add an example in the readme for that as well.

spekulatius commented 3 years ago

Thanks for updating the readme. I think the difference between the two limits is too vague for newcomers. The notion that this lays the base for chucked crawling could be confusing too.

Okay, I've removed the parts about chunked crawling and reworded a few sections.

Could you try to improve that by adding a few examples?

I've added basic examples, do you think they need to be expanded on?

If you're up to it, you may also implement the logic needed to start the crawler using a certain state (continuing from a past, incomplete crawl), and add an example in the readme for that as well.

It's already implemented and working (as far as I know). The last of the three examples I've added should demonstrate it.

Please let me know if you find any issues or something isn't working for you @freekmurze

freekmurze commented 3 years ago

It's already implemented and working (as far as I know). The last of the three examples I've added should demonstrate it.

Nice! Right now, the example only shows how to continue crawling in the same request. Maybe we should also have an example of how to continue crawling in another request. I guess this would mean somehow saving/loading the state of a queue somehow.

Since we're going to tag a new major version, breaking code changes may be made to make saving/loading the state of queue easy. If you want to tackle this, let me know. Otherwise, I'll try to make some time in the next couple of weeks.

spekulatius commented 3 years ago

Hey @freekmurze

Thanks for your feedback. The example is implying that you store the queue in-between - it's not planned to run it within the same request. I could either add a command (for example // Store the queue and reload it later...) or add a serialization step as an example or a complete example project using Laravel (in a separate repo). The latter would probably be most informative and also help testing it proper.

Regarding further breaking changes: We could pass an optional Crawler-configuration array (discussed above) to the create method. Then run all set-commands on the newly created object before passing it back. This way, the work to configure the crawler could be reduced and standardized. How does this sound?

Cheers, Peter

freekmurze commented 3 years ago

I could either add a command (for example // Store the queue and reload it later...) or add a serialization step as an example or a complete example project using Laravel (in a separate repo).

Let's to both, an example in this readme and a simple example project

Regarding further breaking changes: We could pass an optional Crawler-configuration array (discussed above) to the create method. Then run all set-commands on the newly created object before passing it back. This way, the work to configure the crawler could be reduced and standardized. How does this sound?

Send a separate PR for that for me to review.

Thanks for your work on this! ๐Ÿ‘

spekulatius commented 3 years ago

I could either add a command (for example // Store the queue and reload it later...) or add a serialization step as an example or a complete example project using Laravel (in a separate repo).

Let's to both, an example in this readme and a simple example project

Okay, sounds good. I'll get started and get back to you asap.

Regarding further breaking changes: We could pass an optional Crawler-configuration array (discussed above) to the create method. Then run all set-commands on the newly created object before passing it back. This way, the work to configure the crawler could be reduced and standardized. How does this sound?

Send a separate PR for that for me to review.

Okay, no problem.

Thanks for your work on this! +1

No worries, it's helping the project/community as well as myself :)

freekmurze commented 3 years ago

Thanks @spekulatius. I've created a branch v6. Please send your PRs to that branch ๐Ÿ™‚

spekulatius commented 3 years ago

Thanks @spekulatius. I've created a branch v6. Please send your PRs to that branch slightly_smiling_face

Thank you, the PR is already pointing to the new branch. I'll send any follow up PRs to it as well.

spekulatius commented 3 years ago

Hey @freekmurze

summarizing the recent changes: As you already noticed, I've added the example for crawling across requests ๐ŸŽ‰๏ธ

I've also built a super simple application demonstrating crawling in separate requests. This should also help in testing this change - it's docker-based and doesn't require further configuration. Build and run the container and you are good to go. I'm not a docker expert so there is probably better ways to do it, but it works (for me ๐Ÿ˜†๏ธ). But as this is only an example and shouldn't be used by itself anyways that should be fine.

The example is using a package with several classes linking Laravel and the crawler closer together. I've started to put this together while using the crawler. Goal is to make using the crawler easier and increase re-usablity. I'm planning on adding a few more things as I continue working with the crawler. Naturally, it's using my fork of this PR atm. Once the changes here are released, I'll switch it over to the new major release. Until it then, it shouldn't be considered stable.

For the suggested settings-by-array, I'll start working on it in a few days and open a new PR once this one is completed. We can wait with releasing v6 until this is completed.

Besides some minor OCD tidy up here and there, I think that's mostly it. As far as I can see, the PR should be good to be proper reviewed and tested. I'm looking forward to hear what you think. Please let me know if you think changes are needed!

Cheers, Peter

spekulatius commented 3 years ago

Hello @freekmurze

I've noticed when working with the crawler (using concurrency) that the queue driver sometimes gets called for another URL while the old one hasn't been marked done yet (due to still writing to the cache for example). To fix this I've started to return a random URL from the pending list instead of the first. Do you think it would be okay to rename getFirstPendingUrl to getPendingURL to reflect this proper?

Cheers, Peter

freekmurze commented 3 years ago

Thatโ€™s ok! Iโ€™ll review it all soon

On Wed, 25 Nov 2020 at 11:55, Peter Thaleikis notifications@github.com wrote:

Hello @freekmurze https://github.com/freekmurze

I've noticed when working with the crawler (using concurrency) that the queue driver sometimes gets called for another URL while the old one hasn't been marked done yet (due to still writing to the cache for example). To fix this I've started to return a random URL from the pending list instead of the first. Do you think it would be okay to rename getFirstPendingUrl to getPendingURL to reflect this proper?

Cheers, Peter

โ€” You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/spatie/crawler/pull/331#issuecomment-733632808, or unsubscribe https://github.com/notifications/unsubscribe-auth/AADWEDJ6DAU327HNZ43LBY3SRTPDRANCNFSM4THSSMEQ .

-- Freek Van der Herten https://spatie.be +32 495 84 27 91

spekulatius commented 3 years ago

Thanks, I'll add it to this PR shortly.

freekmurze commented 3 years ago

Looks good, I'm going to polish the readme a little bit more.

I'll probably release this next week.

Feel free to PR more improvements to the v6 branch

spekulatius commented 3 years ago

Thank you for reviewing @freekmurze!

Let me know if find anything that needs more work. I'm going to prepare the configuration array as parameter to the create call over the weekend.

freekmurze commented 3 years ago

Great, ping me when that's ready and then I'll review everything for v6

spekulatius commented 3 years ago

Hey @freekmurze

I started the implementation but wasn't too happy the way it changed the crawler into - lots of checks and call just to avoid some additional configuration from the outside.

This is how it would roughly go. I'm less excited now :D Whats your take? Finish or drop?

Cheers, Peter

freekmurze commented 3 years ago

That's not too pretty... Let's drop it.

spekulatius commented 3 years ago

Yeah, agreed @freekmurze. Let's drop it. With this the crawler should be ready for review.

spekulatius commented 3 years ago

Thank you for reviewing and releasing v6 @freekmurze! If any questions come up, feel free to tag me ๐Ÿ’ช๏ธ