Base64 image format not parsed

In WebScrappingStrategy, images are extracted based on the very few possible ways they can get scores in score_image_for_usefulness. Having a preferable format name might be one of the easiest, but base64 images are excluded because their format names were never parsed. A simple change like

                image_src = img.get('src','')
                if "data:image/" in image_src:
                    image_format = image_src.split(',')[0].split(';')[0].split('/')[1]
                else:
                    image_format = os.path.splitext(image_src)[1].lower()

seems enough to fix this.

Background: I would like craw4ai to be able to both process the html files and crawl with Playwright at the same time like Wallabag or Omnivore. I have got local html files downloaded with SingleFile, which keeps a copy of whatever is rendered in the browser in a WYSIWYG manner and encode images in Base64 to keep the resulting file portable. However, crawl4ai won't extract the images in base64.

unclecode / crawl4ai

Base64 image format not parsed #182