unclecode / crawl4ai

🔥🕷️ Crawl4AI: Open-source LLM Friendly Web Crawler & Scrapper
Apache License 2.0
16.58k stars 1.23k forks source link

how use crawl4ai to crawl page content that need login firstly? #262

Closed kksasa closed 1 week ago

kksasa commented 1 week ago

Hello,

I don`t find right way to work such case,as the web I need crawl will jump to login page and later after login will turn to content page.

Here is my test code, but not work. Would you give me one hand?


from playwright.async_api import async_playwright
from crawl4ai import AsyncWebCrawler
import asyncio

async def login_and_get_page(browser):
    context = await browser.new_context(viewport={'width': 1920, 'height': 1080})
    page = await context.new_page()

    await page.goto('https://sample.com')
    await page.fill('input[name="USER"]', '123')
    await page.fill('input[name="PASSWORD"]', '123')
    await page.click('input[type="submit"]')

    title = await page.title() 
    print(f"title: {title}")
    print("login pass") 
    return page

async def main():
    print("[HOOK] on_browser_created")

    async with async_playwright() as p:
        browser = await p.chromium.launch(headless=False)  # 设置 headless=False 以可视化浏览器
        page = await login_and_get_page(browser)

        # 获取登录后页面的 cookies
        cookies = await page.context.cookies()
        print(cookies)
        # 使用 Crawl4AI 爬取页面内容
        async with AsyncWebCrawler(verbose=True, cookies=cookies, headless=False, magic=True) as crawler:
            result = await crawler.arun(
                url="https://sample.com",
            )
        print(result.markdown)

asyncio.run(main())
kksasa commented 1 week ago

I found the solution by add my login code into

async_crawler_strategy.py --> async def crawl(self, url: str, **kwargs) -> AsyncCrawlResponse:

after page = await context.new_page()


await page.goto(url)
await page.fill('input[name="USER"]', 'nnn')
await page.fill('input[name="PASSWORD"]', 'xx')
await page.click('input[type="submit"]')
title = await page.title() 
print(f"PAGE title: {title}")
print("login pass")
unclecode commented 6 days ago

@kksasa Thx for using Crawl4ai. Here's a condensed version of the message:

I believe you're using our library with Playwright in an unintended way. Our library is designed to simplify tasks, and you can achieve most of what you're doing without Playwright. We have features like webhooks, page access, and JavaScript execution that can help.

For example, you can use our "Managed Browsers" feature (coming in a new version) to create a browser session with a pre-logged-in user. Alternatively, you can use our hooks to run JavaScript before crawling, such as filling out a login form and waiting for elements to load.

I'd be happy to provide a simple example of how to use our library effectively. I will add this to my backlog, to create a demo for this and the content, perhaps, and explain how you can do that. Please stay tuned.

rdvo commented 5 days ago

can we pass through cookies to the crawl rest API endpoint so it can be logged in?

unclecode commented 2 days ago

@rdvo are you referring to when you are using Crawl4ai from the running Doctor server? Is that what you mean by the 'Rest API endpoint'? If yes, the answer is yeah, you can pass the cookies definitely. For example you can pass like this:

request = {
        "urls": "https://www.nbcnews.com/business",
        "priority": 8,
        "js_code": [
            "const loadMoreButton = Array.from(document.querySelectorAll('button')).find(button => button.textContent.includes('Load More')); loadMoreButton && loadMoreButton.click();"
        ],
        "wait_for": "article.tease-card:nth-child(10)",
        "crawler_params": {
            "headless": True, "cookies": [{...}, ..]
        }
    }
rdvo commented 2 days ago

yes what are the params there, is it in json format like edithtiscookie chrome addon? what format do we pass them in?

it gets passed as crawler_params options?

Thanks!