mendableai / firecrawl

🔥 Turn entire websites into LLM-ready markdown or structured data. Scrape, crawl and extract with a single API.
https://firecrawl.dev
GNU Affero General Public License v3.0
19.01k stars 1.46k forks source link

[Question] Do you support crawling pages requires login? #546

Open berkantay opened 3 months ago

berkantay commented 3 months ago

I have a use case where I need to extract all the content from a website after logging in, and then convert the products on that site into structured data.

Questions:

  1. Does your tool/library support automated login by passing user credentials to access protected pages?
  2. Or is the functionality limited to extracting data from publicly accessible pages only?
mogery commented 3 months ago

Hey! For one specific site, you can log in manually beforehand, take note of your cookies, and pass pageOptions.headers.Cookie on a crawl request to specify cookies for us to use.

MAS-CreativeLabs commented 2 months ago

Hey! For one specific site, you can log in manually beforehand, take note of your cookies, and pass pageOptions.headers.Cookie on a crawl request to specify cookies for us to use.

Hey @mogery can you please elaborate on that? I need to implement it.

MAS-CreativeLabs commented 2 months ago

Hey! For one specific site, you can log in manually beforehand, take note of your cookies, and pass pageOptions.headers.Cookie on a crawl request to specify cookies for us to use.

Hey @mogery can you please elaborate on that? I need to implement it.

Ii guest he meant you need to manually login to the website and inspect the code, find its cookies, then copy it back to firecrawl.

You can copy this github thread to chatgpt. It will help you out.

Really?!

To be honest, I started my questioning on ChatGPT then I came here for help and now I have to go back to chatGPT ... my question is simple, how do you add a cookie to a Firecrawl API call?

AdolfoVillalobos commented 2 months ago

@MAS-CreativeLabs As I understand, this project uses the requests library. In that case, Cookies are typically passed through the cookies param or the headers param:

Using cookies

import requests

# Cookies as a dictionary
cookies = {
    'session_id': 'your_session_id_here',
    'auth_token': 'your_auth_token_here',
}

response = requests.get('https://example.com/protected-page', cookies=cookies)

print(response.text)

Using headers

import requests

# Construct the cookies as a single header string
cookies_string = 'session_id=your_session_id_here; auth_token=your_auth_token_here'

# Add the cookies string to the headers
headers = {
    'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/58.0.3029.110 Safari/537.3',
    'Cookie': cookies_string,
}

response = requests.get('https://example.com/protected-page', headers=headers)

print(response.text)

Currently, in this library the requests are created in the following fashion:

response = requests.post(
            f'{self.api_url}{endpoint}',
            headers=headers,
            json=scrape_params,
        )

The headers are obtained from the _prepare_headers method, which doesn't consider cookies.

My point is: Currently, It doesn't seem like there is a way in the API to pass cookies if available. Am I wrong @taowang1993 @mogery with this analysis?. I would be happy to submit a PR to solve this issue if you think it might be valuable. Cheers!

AdolfoVillalobos commented 2 months ago

Edit: I now get the point about the pageOptions param:

One of the tests does the following:


def test_successful_response_with_valid_api_key_and_include_html():
    app = FirecrawlApp(api_url=API_URL, api_key=TEST_API_KEY, version='v0')
    response = app.scrape_url('https://roastmywebsite.ai', {'pageOptions': {'includeHtml': True}})
    assert response is not None
    ....

Theoretically you could pass pageOptions.headers.Cookie, but looking at the request preparation logic I still don't get how that value gets passed to the headers. Probably I'm not seeing something

ChinoOragwam commented 2 months ago

Yeah I'm having issues passing in cookies as well. Is this not the correct format?

url = "https://api.firecrawl.dev/v0/scrape"

payload = {
    "url": "your-url-here",
    "pageOptions": {
        'headers': {
            'Cookie': 'your_cookie_string_here'
        }
    }
}

headers = {
    "Authorization": "API-KEY",
    "Content-Type": "application/json"
}

response = requests.request("POST", url, json=payload, headers=headers, verify=False)

print(response.text)
AdolfoVillalobos commented 2 months ago

That's the point: I might be wrong, but I think requests can't pass cookies through the JSON payload

ksbomj commented 2 months ago

Hi @nickscamara,

I reported about this issue here, after you fix it was working for couple days or maybe week but since then it has broken again. I believe it would be beneficial to create an integration test to monitor the working state of this feature. I did my tests with ngrok and netcat only v0 API custom header works but cookie or user agent like documentation says no.

This request works, it pass the custom header:

curl -X POST https://api.firecrawl.dev/v0/scrape \
   -H 'Content-Type: application/json' \
   -H 'Authorization: Bearer fc-71...78' \
   -d '{"url":"https://699f-143-298-38-113.ngrok-free.app", "pageOptions":{"headers": {"randomheader":"randomvalue"}}}' -s | grep randomvalue

Same request but with cookie header doesn't work:

curl -X POST https://api.firecrawl.dev/v0/scrape \
   -H 'Content-Type: application/json' \
   -H 'Authorization: Bearer fc-71...78' \
   -d '{"url":"https://699f-143-298-38-113.ngrok-free.app", "pageOptions":{"headers":{"cookie":"randomvalue=foo;"}}}' -s | grep randomvalue

For v1 API not possible to send any headers

curl -X POST https://api.firecrawl.dev/v1/scrape \
  -H 'Authorization: Bearer fc-71...78' \
  -H 'Content-type: application/json' \
  -d '{"url": "https://699f-143-298-38-113.ngrok-free.app", "headers":{"cookie": "foo=randomvalue;"}}' -s | grep randomvalue

What I doing wrong and how to fix that?

suyu15 commented 2 weeks ago

I used the v0 API and doesn't work either.

mogery commented 2 weeks ago

Added a tracking item for this on our internal board, will look into it. Very weird.