Open ZhangTianrong opened 1 month ago
@ZhangTianrong Thanks for using our library. You're absolutely right; I think this is something we missed. Thank you for your suggestion. I think it's a very good suggestion. I already added it to the library, and I'm going to put it out in the new version 0.3.72 very soon. Hopefully, we'll be releasing it by tonight or tomorrow. Thank you so much.
In
WebScrappingStrategy
, images are extracted based on the very few possible ways they can get scores inscore_image_for_usefulness
. Having a preferable format name might be one of the easiest, but base64 images are excluded because their format names were never parsed. A simple change likeseems enough to fix this.
Background: I would like craw4ai to be able to both process the html files and crawl with Playwright at the same time like Wallabag or Omnivore. I have got local html files downloaded with SingleFile, which keeps a copy of whatever is rendered in the browser in a WYSIWYG manner and encode images in Base64 to keep the resulting file portable. However, crawl4ai won't extract the images in base64.