Closed SaifullahUsmani693 closed 2 weeks ago
Hi @SaifullahUsmani693 check out the fresh version v0.1.5
, now the page is fetched via the PageLoader
class, which allows you to access the playwright instance. For example:
from parsera.engine.simple_extractor import TabularExtractor
from parsera.page import PageLoader
from parsera.engine.model import GPT4oMiniModel
loader = PageLoader()
await loader.load_content(url=url)
## Then you can access loader attributes to perform actions, for example, on the playwright page
scrapper.loader.page.getByRole('button').click()
## Extraction of content at the end
content = await loader.page.content()
## Next you cun run extraction process
model = GPT4oMiniModel()
extractor = TabularExtractor(
elements=elements, model=model, content=content
)
result = await extractor.run()
Amazing Misha! That was really fast and now this software is a gonna be my go-to tool for web scrapping. I'll definitely contribute to this project and promote it in my projects and among my clients.
@SaifullahUsmani693 I'll appreciate your contribution, thanks!
After putting the URL, it would be amazing to perform clicks (like on the load more button), scroll down for auto pagination, fill forms or captchas, click read more on texts that are hidden by default, etc That can be done if I get back the playwright instance (or other packages can be returned as well like selenium, scrappy, etc as optional scraping tools)
This way, when the HTML is 100% loaded as per the needs, then passing that HTML with the elements (prompt telling what to scrape) will make this tool extremely powerful and dynamic.
It'll make it a complete web scraping solution.