raznem / parsera

Lightweight library for scraping web-sites with LLMs
https://parsera.org
GNU General Public License v2.0
732 stars 47 forks source link

Easy access to playwright instance. #14

Closed SaifullahUsmani693 closed 2 weeks ago

SaifullahUsmani693 commented 3 weeks ago

After putting the URL, it would be amazing to perform clicks (like on the load more button), scroll down for auto pagination, fill forms or captchas, click read more on texts that are hidden by default, etc That can be done if I get back the playwright instance (or other packages can be returned as well like selenium, scrappy, etc as optional scraping tools)

This way, when the HTML is 100% loaded as per the needs, then passing that HTML with the elements (prompt telling what to scrape) will make this tool extremely powerful and dynamic.

It'll make it a complete web scraping solution.

raznem commented 3 weeks ago

Hi @SaifullahUsmani693 check out the fresh version v0.1.5, now the page is fetched via the PageLoader class, which allows you to access the playwright instance. For example:

from parsera.engine.simple_extractor import TabularExtractor
from parsera.page import PageLoader
from parsera.engine.model import GPT4oMiniModel

loader = PageLoader()
await loader.load_content(url=url)
## Then you can access loader attributes to perform actions, for example, on the playwright page
scrapper.loader.page.getByRole('button').click()

## Extraction of content at the end
content = await loader.page.content()

## Next you cun run extraction process
model = GPT4oMiniModel()
extractor = TabularExtractor(
            elements=elements, model=model, content=content
)
result = await extractor.run()
SaifullahUsmani693 commented 2 weeks ago

Amazing Misha! That was really fast and now this software is a gonna be my go-to tool for web scrapping. I'll definitely contribute to this project and promote it in my projects and among my clients.

raznem commented 2 weeks ago

@SaifullahUsmani693 I'll appreciate your contribution, thanks!