unclecode / crawl4ai

🔥🕷️ Crawl4AI: Open-source LLM Friendly Web Crawler & Scrapper
Apache License 2.0
16.58k stars 1.23k forks source link

Feature Request: Filtering for Small and Invisible Text #274

Open nelzomal opened 1 week ago

nelzomal commented 1 week ago

Currently, there is filtering for small, invisible, or irrelevant images. However, implementing similar filtering for small or invisible text is equally important, as such text can significantly impact content quality by introducing noise or misleading information.

I would like to know if there is any plan to implement this feature. If not, I’d be happy to contribute by working on a pull request. Could someone provide pointers to the relevant parts of the codebase that would need modification to add this functionality?

unclecode commented 5 days ago

@nelzomal Thank you so much for the suggestion, and I do agree with that. Please go ahead and create the pull request and also share your email address with me. I will send you a Discord invitation. I would love to see if you can help us and also proceed with this suggestion. And the part of the code check content_scraping_strategy.py::WebScrapingStrategy.score_image_for_usefulness(). Also wait until I release the new version, then refer to that version from the main branch; as of now, this function is at line 244. Appreciate your collaboration.

And I need you to pay attention to one very important thing. For me, the processing time of scraping is crucial. Right now, the average has become around 100 milliseconds. I spent quality time to make it very efficient. Therefore, adding any new steps or process comes with the cost of computation time. I need you to test the computation time for multiple websites before and after you apply this change and make sure that we're not losing any time. Thank you so much.