Open berkantay opened 3 months ago
Hey! For one specific site, you can log in manually beforehand, take note of your cookies, and pass pageOptions.headers.Cookie
on a crawl request to specify cookies for us to use.
Hey! For one specific site, you can log in manually beforehand, take note of your cookies, and pass
pageOptions.headers.Cookie
on a crawl request to specify cookies for us to use.
Hey @mogery can you please elaborate on that? I need to implement it.
Hey! For one specific site, you can log in manually beforehand, take note of your cookies, and pass
pageOptions.headers.Cookie
on a crawl request to specify cookies for us to use.Hey @mogery can you please elaborate on that? I need to implement it.
Ii guest he meant you need to manually login to the website and inspect the code, find its cookies, then copy it back to firecrawl.
You can copy this github thread to chatgpt. It will help you out.
Really?!
To be honest, I started my questioning on ChatGPT then I came here for help and now I have to go back to chatGPT ... my question is simple, how do you add a cookie to a Firecrawl API call?
@MAS-CreativeLabs As I understand, this project uses the requests library. In that case, Cookies are typically passed through the cookies param or the headers param:
cookies
import requests
# Cookies as a dictionary
cookies = {
'session_id': 'your_session_id_here',
'auth_token': 'your_auth_token_here',
}
response = requests.get('https://example.com/protected-page', cookies=cookies)
print(response.text)
import requests
# Construct the cookies as a single header string
cookies_string = 'session_id=your_session_id_here; auth_token=your_auth_token_here'
# Add the cookies string to the headers
headers = {
'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/58.0.3029.110 Safari/537.3',
'Cookie': cookies_string,
}
response = requests.get('https://example.com/protected-page', headers=headers)
print(response.text)
Currently, in this library the requests are created in the following fashion:
response = requests.post(
f'{self.api_url}{endpoint}',
headers=headers,
json=scrape_params,
)
The headers are obtained from the _prepare_headers
method, which doesn't consider cookies.
My point is: Currently, It doesn't seem like there is a way in the API to pass cookies if available. Am I wrong @taowang1993 @mogery with this analysis?. I would be happy to submit a PR to solve this issue if you think it might be valuable. Cheers!
Edit: I now get the point about the pageOptions
param:
One of the tests does the following:
def test_successful_response_with_valid_api_key_and_include_html():
app = FirecrawlApp(api_url=API_URL, api_key=TEST_API_KEY, version='v0')
response = app.scrape_url('https://roastmywebsite.ai', {'pageOptions': {'includeHtml': True}})
assert response is not None
....
Theoretically you could pass pageOptions.headers.Cookie
, but looking at the request preparation logic I still don't get how that value gets passed to the headers. Probably I'm not seeing something
Yeah I'm having issues passing in cookies as well. Is this not the correct format?
url = "https://api.firecrawl.dev/v0/scrape"
payload = {
"url": "your-url-here",
"pageOptions": {
'headers': {
'Cookie': 'your_cookie_string_here'
}
}
}
headers = {
"Authorization": "API-KEY",
"Content-Type": "application/json"
}
response = requests.request("POST", url, json=payload, headers=headers, verify=False)
print(response.text)
That's the point: I might be wrong, but I think requests can't pass cookies through the JSON payload
Hi @nickscamara,
I reported about this issue here, after you fix it was working for couple days or maybe week but since then it has broken again. I believe it would be beneficial to create an integration test to monitor the working state of this feature. I did my tests with ngrok and netcat only v0 API custom header works but cookie or user agent like documentation says no.
This request works, it pass the custom header:
curl -X POST https://api.firecrawl.dev/v0/scrape \
-H 'Content-Type: application/json' \
-H 'Authorization: Bearer fc-71...78' \
-d '{"url":"https://699f-143-298-38-113.ngrok-free.app", "pageOptions":{"headers": {"randomheader":"randomvalue"}}}' -s | grep randomvalue
Same request but with cookie header doesn't work:
curl -X POST https://api.firecrawl.dev/v0/scrape \
-H 'Content-Type: application/json' \
-H 'Authorization: Bearer fc-71...78' \
-d '{"url":"https://699f-143-298-38-113.ngrok-free.app", "pageOptions":{"headers":{"cookie":"randomvalue=foo;"}}}' -s | grep randomvalue
For v1 API not possible to send any headers
curl -X POST https://api.firecrawl.dev/v1/scrape \
-H 'Authorization: Bearer fc-71...78' \
-H 'Content-type: application/json' \
-d '{"url": "https://699f-143-298-38-113.ngrok-free.app", "headers":{"cookie": "foo=randomvalue;"}}' -s | grep randomvalue
What I doing wrong and how to fix that?
I used the v0 API and doesn't work either.
Added a tracking item for this on our internal board, will look into it. Very weird.
I have a use case where I need to extract all the content from a website after logging in, and then convert the products on that site into structured data.
Questions: