mendableai / firecrawl

🔥 Turn entire websites into LLM-ready markdown or structured data. Scrape, crawl and extract with a single API.
https://firecrawl.dev
GNU Affero General Public License v3.0
7.42k stars 541 forks source link

[BUG] Crawl should grab pdf files on a page by default #342

Closed calebpeffer closed 2 days ago

calebpeffer commented 6 days ago

I'm not sure if this is a bug or feature tbh.

On this webpage, https://ir.veeva.com/investors/news-and-events/events-and-presentations/default.aspx, there are a bunch of PDF files a customer wanted to grab. Scrape was able to grab all the pdf links and put in the markdown.

Shouldn't the expected behavior be that all the PDFs are crawled and converted as pages?

rafaelsideguide commented 6 days ago

I though the problem here was that the PDFs have links that are not in the same baseURL as the initial page (which would be solved by #336), but when I tested it, the page content showed some captcha blockers. This shouldn't happen because we have proxies and bee.

...
[](https://ir.veeva.com/q4api/v4/captcha?clientId=_ctrl0_ctl18_UCCaptcha) |     |
| **Enter the code shown above.** |     |
| \\*  |     |

[Unsubscribe](/investors/resources/email-alerts/default.aspx)

[Powered By Q4 Inc. 5.128.1.2](http://q4inc.com/Powered-by-Q4/)",
...

@nickscamara I think this would be a good challenge for our new maintainers 😬

calebpeffer commented 3 days ago

Following up on this issue. Also, bring in @emrek823 to observe (the customer who brought this issue up)