raznem / parsera

Lightweight library for scraping web-sites with LLMs
https://parsera.org
GNU General Public License v2.0
722 stars 47 forks source link

Add support for websites that require a login #6

Closed XD-coder closed 4 days ago

XD-coder commented 4 weeks ago

Websites that require a login are a huge pain in @ss. I think it would be a good idea to use a llm to find where to enter the user details or what https request to pass to login.

mikebgrep commented 4 weeks ago

Playwrite support logins.

raznem commented 2 weeks ago

Hi, @XD-coder In v0.1.7 a new class ParseraScript has been added, it allows executing custom playwright scripts during scraping. For example, you can log in to parsera.org and get your number of credits with the following code:

# Define the script to execute during the session creation
async def initial_script(page: Page) -> Page:
    await page.goto("https://parsera.org/auth/sign-in")
    await page.wait_for_load_state("networkidle")
    await page.get_by_label("Email").fill(EMAIL)
    await page.get_by_label("Password").fill(PASSWORD)
    await page.get_by_role("button", name="Sign In", exact=True).click()
    await page.wait_for_selector("text=Playground")
    return page

# This script is executed after the url is opened
async def repeating_script(page: Page) -> Page:
    await page.wait_for_timeout(1000)  # Wait one second for page to load
    return page

parsera = ParseraScript(model=model, initial_script=initial_script)
result = await parsera.arun(
    url="https://parsera.org/app",
    elements={
        "credits": "number of credits",
    },
    playwright_script=repeating_script,
)