mendableai / firecrawl

🔥 Turn entire websites into LLM-ready markdown or structured data. Scrape, crawl and extract with a single API.
https://firecrawl.dev
GNU Affero General Public License v3.0
7.18k stars 521 forks source link

[BUG] Images show up only with svg information #288

Open AndyMik90 opened 2 weeks ago

AndyMik90 commented 2 weeks ago

Describe the Bug When scraping sites, like valma.ai: we only get images back with (data:image/svg+xml,%3Csvg%20xmlns='http://www.w3.org/2000/svg'%20viewBox='0%200%20768%20768'%3E%3C/svg%3E) To Reproduce Steps to reproduce the issue: Scrape valma.ai

Additional information: The only image that shows with correct link is the logo with .webp format others that are .png etc. does only show as data:image/svg+xml

rafaelsideguide commented 1 week ago

Hey @AndyMik90 I just checked and it seems like valma.ai uses a wordpress plugin that renders all images as a svg file before displaying the actual image. This probably affects SEO/performance metrics.

What I suggest is using the parameter { pageOptions: { waitFor: 1000 } } in your request, so the scraper will wait for the pages to fully render before extracting the data.

AndyMik90 commented 1 week ago

Hey @AndyMik90 I just checked and it seems like valma.ai uses a wordpress plugin that renders all images as a svg file before displaying the actual image. This probably affects SEO/performance metrics.

What I suggest is using the parameter { pageOptions: { waitFor: 1000 } } in your request, so the scraper will wait for the pages to fully render before extracting the data.

Thanks. I will try it, but I'm afraid it may be because we use more advanced speed optimization techniques, like delayed Javascript execution. Basically, we would need a user interaction (click, scroll, etc.) to trigger the JS to load.

This is great for speed but not for scraping.

rafaelsideguide commented 1 week ago

@AndyMik90 Awesome! If you need to scroll a specific component in html, you can use the { pageOptions: { scrollXPaths: string[] } } with the component's XPath. We haven't implemented a way to click yet, but we can consider adding it if it makes sense.