spatie / crawler

An easy to use, powerful crawler implemented in PHP. Can execute Javascript.
https://freek.dev/308-building-a-crawler-in-php
MIT License
2.49k stars 356 forks source link

Cookies in clientOptions #465

Closed davidgeisler1998 closed 5 days ago

davidgeisler1998 commented 3 months ago

Hey there :)

I'm using the Spatie crawler to crawl Cookies from a website. To Receive all cookies I need to "Accept" the CookieConsent. Therefor I would like to set a cookie before the crawler visits the website.

I found out that the Crawler swap the "RequestOptions::COOKIES => true" Option to an empty CookieJar. After that I tried to instanciate a CookieJar and provide it as the value instead true but that didnt work.

Another try was to set a Header option like this: RequestOptions::HEADERS => [ 'User-Agent' => Crawler::DEFAULT_USER_AGENT, 'Set-Cookie' => '{testName=testValue; path=/; secure; HttpOnly}', ]

But this is still not working. Do you have an idea how I can set cookies, before the crawler visits a website?

Redominus commented 3 months ago

Hi, RequestOptions are part of GuzzleHttp library, this includes the cookie Jar. You can check here https://github.com/guzzle/guzzle/blob/429cb6702659329819fb40c9487eac3132bdd80b/src/Client.php#L260 where the cookies config is converted from true to a CookieJar. You can pass a instantiated CookieJar instead of true. There is a static function from the CookieJar to help instantiate it. Also there is a static method in the SetCookie Class to hel instantiate it and then create a new CookieJar with an array of SetCookie objects. I personally prefer the later as it allow me to config each cookie. You can read more about the cookies in guzzle here https://docs.guzzlephp.org/en/stable/quickstart.html?highlight=cookie#cookies

Finally Set-Cookie is a server response header MDN doc. You should use the Cookie header MDN dock

Regards

davidgeisler1998 commented 3 months ago

Hi again :) Thanks for your answer.

Now I got another Task and I want to click something with the Browsershot::click() method.

Is it possible, that the click will be executed only once at the start of the crawl and wont be called by the following crawls?

e.g. I want to accept the cookieconsent one time, and after that its already accepted, but i got the problem that all other pages dont know the cookieconsent?!