jupyter-naas / awesome-notebooks

A powerful data & AI notebook templates catalog: prompts, plugins, models, workflow automation, analytics, code snippets - following the IMO framework to be searchable and reusable in any context.
https://naas.ai/search
BSD 3-Clause "New" or "Revised" License
2.65k stars 446 forks source link

LangChain - Perform Web scraping #2267

Open FlorentLvr opened 11 months ago

FlorentLvr commented 11 months ago

This notebook performs web scraping to gather content from the web and running a LLM over them. It is usefull for organizations to breakthough and achieve their goals.

FlorentLvr commented 11 months ago

🚀 Branch and template have been created and pushed. You should work on:

Mohitraut07 commented 11 months ago

I am a developer and I would love to work on this issue please assign this to me.

FlorentLvr commented 11 months ago

Hi @Mohitraut07 , I am glad want to contribute! Please follow these instructions in the awesome-notebook README.md to start contributing. -> https://github.com/jupyter-naas/awesome-notebooks/blob/master/README.md#how-to-contribute Let me know if you have any questions! 🙏 Cheers!

FlorentLvr commented 11 months ago

Hi @Mohitraut07 , I am glad want to contribute! Please follow these instructions in the awesome-notebook README.md to start contributing. -> https://github.com/jupyter-naas/awesome-notebooks/blob/master/README.md#how-to-contribute Let me know if you have any questions! 🙏 Cheers!

@Mohitraut07, just checking in! I didn't receive your application: https://bit.ly/3F8Jsjr Let me know if you have any questions.

FlorentLvr commented 11 months ago

@Mohitraut07, Just checking in, is everything okay?

hope205 commented 11 months ago

Hi, @FlorentLvr , I want to work on this issue, Can you please assign it to me?

FlorentLvr commented 11 months ago

Hi, @FlorentLvr , I want to work on this issue, Can you please assign it to me?

@hope205! Sure, let us know if you have any question :) @srini047

srini047 commented 11 months ago

Hi, @FlorentLvr , I want to work on this issue, Can you please assign it to me?

@hope205! Sure, let us know if you have any question :) @srini047

Awesome @hope205, Feel free to reach out incase you need anything. I can assist you further. Looking forward to the contribution.

hope205 commented 11 months ago

🚀 Branch and template have been created and pushed. You should work on:

when I cloned this repo, I couldn't find the langchain perform web scraping notebook.

FlorentLvr commented 11 months ago

🚀 Branch and template have been created and pushed. You should work on:

when I cloned this repo, I couldn't find the langchain perform web scraping notebook.

Did you switch to the right branch? I can see the template in Github:

image
srini047 commented 11 months ago

@hope205 Make sure to see that you are in the right branch and head to the directory as suggested by @FlorentLvr image

hope205 commented 11 months ago

Thanks @srini047. I have gotten it already. Started working on it

hope205 commented 11 months ago

Hello @FlorentLvr, I have been working on the notebook but I am encountering errors from the langchain frame work itself. The AsyncChromiumLoader library has some internal issues

from langchain.document_loaders import AsyncChromiumLoader
from langchain.document_transformers import BeautifulSoupTransformer

# Load HTML
loader = AsyncChromiumLoader([url])
html = `loader.load()`

it gives an error at this point. Here is the error it gives

---------------------------------------------------------------------------
RuntimeError                              Traceback (most recent call last)
Input In [3], in <cell line: 3>()
      1 # Load HTML
      2 loader = AsyncChromiumLoader([url])
----> 3 html = loader.load()

File ~\AppData\Local\Programs\Python\Python310\lib\site-packages\langchain\document_loaders\chromium.py:90, in AsyncChromiumLoader.load(self)
     81 def load(self) -> List[Document]:
     82     """
     83     Load and return all Documents from the provided URLs.
     84 
   (...)
     88 
     89     """
---> 90     return list(self.lazy_load())

File ~\AppData\Local\Programs\Python\Python310\lib\site-packages\langchain\document_loaders\chromium.py:77, in AsyncChromiumLoader.lazy_load(self)
     66 """
     67 Lazily load text content from the provided URLs.
     68 
   (...)
     74 
     75 """
     76 for url in self.urls:
---> 77     html_content = asyncio.run(self.ascrape_playwright(url))
     78     metadata = {"source": url}
     79     yield Document(page_content=html_content, metadata=metadata)

File ~\AppData\Local\Programs\Python\Python310\lib\asyncio\runners.py:33, in run(main, debug)
      9 """Execute the coroutine and return the result.
     10 
     11 This function runs the passed coroutine, taking care of
   (...)
     30     asyncio.run(main())
     31 """
     32 if events._get_running_loop() is not None:
---> 33     raise RuntimeError(
     34         "asyncio.run() cannot be called from a running event loop")
     36 if not coroutines.iscoroutine(main):
     37     raise ValueError("a coroutine was expected, got {!r}".format(main))

RuntimeError: asyncio.run() cannot be called from a running event loop
srini047 commented 11 months ago

BeautifulSoupTransformer

Hey @hope205 !

Sorry for the delay in response. Did you install playwright and are you trying it in naas lab? This can be a problem as the drivers may not be able to run cloud based juyter environemnts.

hope205 commented 11 months ago

No problem @srini047 . I installed playwright but I am not using nass.ai labs. I am running it on my pc

FlorentLvr commented 11 months ago

No problem @srini047 . I installed playwright but I am not using nass.ai labs. I am running it on my pc

Hey @hope205! Just checking in, did you make some progress? 🙏

hope205 commented 11 months ago

I am still working on it