[BUG] Crawl should grab pdf files on a page by default

calebpeffer commented 6 days ago

I'm not sure if this is a bug or feature tbh.

On this webpage, https://ir.veeva.com/investors/news-and-events/events-and-presentations/default.aspx, there are a bunch of PDF files a customer wanted to grab. Scrape was able to grab all the pdf links and put in the markdown.

Shouldn't the expected behavior be that all the PDFs are crawled and converted as pages?

rafaelsideguide commented 6 days ago

I though the problem here was that the PDFs have links that are not in the same baseURL as the initial page (which would be solved by #336), but when I tested it, the page content showed some captcha blockers. This shouldn't happen because we have proxies and bee.

...
[](https://ir.veeva.com/q4api/v4/captcha?clientId=_ctrl0_ctl18_UCCaptcha) |     |
| **Enter the code shown above.** |     |
| \\*  |     |

[Unsubscribe](/investors/resources/email-alerts/default.aspx)

[Powered By Q4 Inc. 5.128.1.2](http://q4inc.com/Powered-by-Q4/)",
...

@nickscamara I think this would be a good challenge for our new maintainers 😬

calebpeffer commented 3 days ago

Following up on this issue. Also, bring in @emrek823 to observe (the customer who brought this issue up)

mendableai / firecrawl

[BUG] Crawl should grab pdf files on a page by default #342