Closed shamspias closed 1 year ago
Does SeleniumURLLoader
work for you? From my experimentation, it seems to use a headless version to load a webpage.
Ex:
from langchain.document_loaders import SeleniumURLLoader
from langchain.text_splitter import NLTKTextSplitter
def __load_url(url_strings):
loader = SeleniumURLLoader(urls=url_strings)
pages = loader.load()
text_splitter = NLTKTextSplitter(chunk_size=500, chunk_overlap=100)
docs = text_splitter.split_documents(pages)
return docs
It's not. I try with this I try to use customized still not working :/
Hi, @shamspias! I'm Dosu, and I'm here to help the LangChain team manage their backlog. I wanted to let you know that we are marking this issue as stale.
From what I understand, you requested a feature to add an option in WebBaseLoader to handle dynamic content loaded via JavaScript when scraping webpages. You provided a code snippet using pyppeteer as an alternative to Selenium, but it seems that it is not working for you. Sommohapatra suggested using SeleniumURLLoader as a possible solution, but you confirmed that it is not working either.
Before we close this issue, we wanted to check with you if it is still relevant to the latest version of the LangChain repository. If it is, please let us know by commenting on the issue. Otherwise, feel free to close the issue yourself or it will be automatically closed in 7 days.
Thank you for your understanding and contribution to the LangChain project!
Feature request
When you request a webpage using a library like requests or aiohttp, you're getting the initial HTML of the page, but any content that's loaded via JavaScript after the page loads will not be included. That's why you might see template tags like (item.price)}} taka instead of the actual values. Those tags are placeholders that get filled in with actual data by JavaScript after the page loads.
To handle this, you'll need to use a library that can execute JavaScript. A commonly used one is Selenium, but it's heavier than requests or aiohttp because it requires running an actual web browser. But is there any other option that doesn't need running an actual web browser or can use in long-chain without needing graphical interface like using Headless Browsers tools like
pyppeteer
(Python wrapper for Puppeteer)anyway please solve the issue and ad feathers like this. thanks in advance.
Motivation
To get dynamic content from a webpage while scraping text from a website or webpage.
Your contribution
For my side, I rewrite the _fetch method in your WebBaseLoader class to use pyppeteer instead of aiohttp. But still not working but I think this might little help. here is my code, there I Overwirte the class
and Install this two lib