unclecode / crawl4ai

🔥🕷️ Crawl4AI: Open-source LLM Friendly Web Crawler & Scrapper
Apache License 2.0
16.61k stars 1.23k forks source link

Base64 image format not parsed #182

Open ZhangTianrong opened 1 month ago

ZhangTianrong commented 1 month ago

In WebScrappingStrategy, images are extracted based on the very few possible ways they can get scores in score_image_for_usefulness. Having a preferable format name might be one of the easiest, but base64 images are excluded because their format names were never parsed. A simple change like

                image_src = img.get('src','')
                if "data:image/" in image_src:
                    image_format = image_src.split(',')[0].split(';')[0].split('/')[1]
                else:
                    image_format = os.path.splitext(image_src)[1].lower()

seems enough to fix this.

Background: I would like craw4ai to be able to both process the html files and crawl with Playwright at the same time like Wallabag or Omnivore. I have got local html files downloaded with SingleFile, which keeps a copy of whatever is rendered in the browser in a WYSIWYG manner and encode images in Base64 to keep the resulting file portable. However, crawl4ai won't extract the images in base64.

unclecode commented 1 month ago

@ZhangTianrong Thanks for using our library. You're absolutely right; I think this is something we missed. Thank you for your suggestion. I think it's a very good suggestion. I already added it to the library, and I'm going to put it out in the new version 0.3.72 very soon. Hopefully, we'll be releasing it by tonight or tomorrow. Thank you so much.