langchain-ai / langchain

🦜🔗 Build context-aware reasoning applications
https://python.langchain.com
MIT License
92.23k stars 14.72k forks source link

Issue: getting not implemented error when using async chromium loader #12763

Closed HRUSHI1212 closed 2 months ago

HRUSHI1212 commented 10 months ago

Issue you'd like to raise.

I tried to use webscraping by using langchain and used the async chromium loader as per the lanchain documentation but I was getting error asyncio.run() error so I made some chnages that i found on web but its still not working . below is the code: import asyncio import playwright from langchain.document_loaders import AsyncChromiumLoader from langchain.document_transformers import BeautifulSoupTransformer import nest_asyncio nest_asyncio.apply()

async def main():

Create an instance of AsyncChromiumLoader with a list of URLs

urls = ["https://www.hdfcbank.com/personal/borrow/popular-loans/personal-loan/eligibility"]  # Replace with your desired URLs
loader = AsyncChromiumLoader(urls)

# Load the URLs
html = await loader.load()

# Print the HTML content
for url, content in html.items():
    print(f"URL: {url}\nHTML Content: {content}\n")

if name == "main": asyncio.run(main())

Suggestion:

No response

dosubot[bot] commented 10 months ago

🤖

Hello @HRUSHI1212,

I'm here to assist you with your issue. I'm an AI bot designed to help solve bugs, answer questions, and guide you in becoming a contributor while you're waiting for a human maintainer.

Based on the information you've provided, it seems like you're encountering a 'not implemented' error when using the async chromium loader for web scraping with LangChain. This error is likely due to the use of asyncio.run() in the lazy_load() method of the AsyncChromiumLoader class. The asyncio.run() function is used to execute a coroutine and return the result, but it cannot be used when the event loop is already running.

To fix this issue, you can use the run_until_complete() method of the event loop instead of asyncio.run(). Here's how you can modify the lazy_load() method:

def lazy_load(self) -> Iterator[Document]:
    """
    Lazily load text content from the provided URLs.

    This method yields Documents one at a time as they're scraped,
    instead of waiting to scrape all URLs before returning.

    Yields:
        Document: The scraped content encapsulated within a Document object.
    """
    loop = asyncio.get_event_loop()
    for url in self.urls:
        html_content = loop.run_until_complete(self.ascrape_playwright(url))
        metadata = {"source": url}
        yield Document(page_content=html_content, metadata=metadata)

In this modified version, asyncio.get_event_loop() is used to get the current event loop, and then run_until_complete() is used to run the ascrape_playwright() coroutine until it is complete. This should resolve the 'not implemented' error.

If you're still encountering issues, it might be due to using an outdated version of LangChain or the 'playwright' module. In such a case, upgrading to the latest versions is recommended. You can install the 'playwright' module in the Python environment by executing the command pip install playwright in the terminal. After installing the 'playwright' module, run the command playwright install to download the necessary browser binaries that 'playwright' needs to function correctly.

I hope this helps! If you have any other questions or need further clarification, feel free to ask.

Best, Dosu

Sources


This response is meant to be useful and save you time. It is not meant to be a precise solution, but rather a starting point for your own research.

Help me be more useful! Please leave a 👍 if this is helpful and 👎 if it is irrelevant.

If you want to continue the conversation, start your reply with @dosu-bot.

HRUSHI1212 commented 10 months ago

this is also not working

anudeepadi commented 8 months ago

@HRUSHI1212 Did you try executing with the commands pip install playwright and then playwright install because it worked for me with these.

jerry1900 commented 7 months ago

i met the same question with "RuntimeError: asyncio.run() cannot be called from a running event loop" using jupyter as posted below: image 1706169408086

what's interesting is that, i change another development environment Pycharm to load the same code ,it worked as below: 1706169523341

I highly suspected that Jupyter which is working on browser kernel ,would have problems using asyn method of another browser(Chromium),it has some conflicts. So when you meet this problem, you may change another development environment to have another shot.

IAmStoxe commented 6 months ago

To run asyncio in jupyter, add this bad boy to the top:

https://gist.github.com/IAmStoxe/fe3ec1f4420163cb80a974b95e7962cb

# This import is required only for jupyter notebooks, since they have their own eventloop
import nest_asyncio

nest_asyncio.apply()
karan842 commented 5 months ago

@jerry1900 thanks for sharing this tip of environment :)

I was using Jupyter Notebook inside VSCode and still got NotImplementedError.

Then I move to .py script and run exact code.

It worked for me.

IAmStoxe commented 5 months ago

That's precisely how I do it. Maybe restart the kernel.

On Sat, Mar 30, 2024, 1:13 AM Karan Shingde @.***> wrote:

@jerry1900 https://github.com/jerry1900 thanks for sharing this tip of environment :)

I was using Jupyter Notebook inside VSCode and still got NotImplementedError.

Then I move to .py script and run exact code.

It worked for me.

— Reply to this email directly, view it on GitHub https://github.com/langchain-ai/langchain/issues/12763#issuecomment-2027965054, or unsubscribe https://github.com/notifications/unsubscribe-auth/AD5UPIYPWL66E6TQZYTJIUTY2ZX4NAVCNFSM6AAAAAA62IYMQ6VHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDAMRXHE3DKMBVGQ . You are receiving this because you commented.Message ID: @.***>