unclecode / crawl4ai

🔥🕷️ Crawl4AI: Open-source LLM Friendly Web Crawler & Scrapper
Apache License 2.0
3.03k stars 248 forks source link

Support extraction of media files from <source> tag in HTML #71

Closed aravindkarnam closed 3 weeks ago

aravindkarnam commented 1 month ago

Source tags pointing multiple media file formats, are used inside other media tags like audio and video. eg:

<video controls>
  <source src="foo.webm" type="video/webm" />
  <source src="foo.ogg" type="video/ogg" />
  <source src="foo.mov" type="video/quicktime" />
  I'm sorry; your browser doesn't support HTML video.
</video>

This way browsers can pick the first format they support. Currently we are only extracting based on src attribute on tags like audio, video etc.

For example the following site has several high quality videos inside, but these videos are not getting extracted.

@unclecode I'll try to work on it this weekend. Will raise a PR once completed.

unclecode commented 1 month ago

@aravindkarnam Oh you’re absolutely right—great job noticing that! Please go ahead and let me know the results. Also, consider what other similar tags we might be missing. For example:

<picture>
  <source media="(min-width:650px)" srcset="img_pink_flowers.jpg">
  <source media="(min-width:465px)" srcset="img_white_flower.jpg">
  <img src="img_orange_flowers.jpg" alt="Flowers" style="width:auto;">
</picture>
unclecode commented 1 month ago

@aravindkarnam Hi, I hope you are doing well. Did you find time to work on this?

aravindkarnam commented 1 month ago

@unclecode Hey! Doing well. I've added support for extraction of media from source tags in both Audio and Video tags. Added a commit here

As for the picture tag, I've read in the docs that

The <img> element is required as the last child of the <picture> element, as a fallback option if none of the source tags matches.

Since img tag is required child of <picture> tag, the existing code for image extraction will catch it. Hence did not add any separate handling for it.

While I was at it, I've extended the extraction of text description from nearest parent element to video and audio also. The results looks splendid. I believe we don't need any filtering for videos and audio similar to images, because there aren't usually as many in a website unless it's a video site like youtube etc.

Please create a staging branch for next release, I can send a PR to it.

unclecode commented 1 month ago

Hello @aravindkarnam, it’s been a while! I just returned from a short business trip. I created the stage branch and pushed updates to the main, including some fixes. Please update and send the PR. Additionally, I think it’s a good time to work on integration with Langchain and Llamaindex. Let me know are you interested to manage it?

aravindkarnam commented 1 month ago

@unclecode Hi! Raised a PR to staging.

I think it’s a good time to work on integration with Langchain and Llamaindex. Let me know are you interested to manage it?

I've been working with Langchain for past few months on my own project. So yeah, I can pick this one up. Good opportunity for me also to explore a little deeper. Since I'm exposed to langchain already I can pick that up first.

Looks like we have to build it as a document loader component. On preliminary inspection, it looks like these document loaders are implemented either as API or package.

If Implemented as API, then users will have to host crawl4ai as a service somewhere, and perhaps use variables like API endpoint, auth headers etc to connect to that service whereever they are calling the document loader from. I can see that firecrawl is also integrated this way only.

If implemented as a Package then, crawl4ai will be used as a dependancy directly within say a flask or django app. I can see that some popular python packages are integrated this way. This will be ideal for notebooks also.

What are your thoughts on both approaches. We can do either one or do both. I think package will be better for people/users who will build apps as a monolith, expermenters with notebooks etc and API approach will be better for developers building apps as microservices. I'm building my app in the microservices architecture itself.

Also have you given any thought to discord/slack for collaborators, users. It'll be easier to manage our communications there and move faster. We can move stuff to GH once we triangulated bugs/finalised implementation etc.

I'll create a GH issue for Langchain integration tomorrow and pick it up over the weekend.

unclecode commented 3 weeks ago

@aravindkarnam Hey Arvind, rough week on my end, but glad to hear you have experience with Langchain. Let's go for it. I agree with you, let's package Crawl4AI for local use instead of just as an API. If we see solid engagement, we can look into putting it online as a service later on when we’re more comfortable.

Checked your pull request, all looks good. I’ve made a few other changes to the libraries, so I’ll merge and release a new version soon. As for Discord, I can set up the channels and make you a moderator. Would appreciate your help getting it started, maybe just a few simple channels to begin with and start inviting people. Sound good?

aravindkarnam commented 3 weeks ago

@unclecode sounds good. You can send the discord invite to my email aravind.karanam@gmail.com. I will set aside sometime everyday to help you with moderation and bug smashing. This is a very important and relevant project for our present times in my view, that should be fairly and easily available for all builders. Feels good to be part of it in any way helpful.

Re: langchain integration: Ack on the local package approach first and we can do API approach later based on adoption. I'll open that in a new issue and start work on it from today.

unclecode commented 3 weeks ago

@aravindkarnam Great to hear you're enjoying this, Arvind! I'm excited you're in, and the fact that you're having fun makes it even better. I know it's early, but I have long-term plans for this side project, and I'm investing in it to see where it goes. For the next version, I've already added a strategy to crawl videos and audio, including YouTube, and it's going really well. I'm also planning to switch to a strategy design pattern, instead of current parser method, then we'll use a factory method to select the right crawler strategy based on the source. This way, we can handle webpages, audio, video, and easily extend it to other data sources in the future. I'll share more once it's ready for testing. Also, I'll set up a Discord and send you the link. I'm happy you'll be putting in some time daily, absolutely awesome. Thanks a lot and "Let'sCrawl the Web, Fast and Thorough" 😎