Open rkroelin opened 4 weeks ago
@rkroelin I quickly checked the page, and it seems like the links you're interested in are not children of the original page. For example, https://spp.org/western-services-documents?id=296577#:~:text=Markets%2B%20Governance%20Nominations is not a child link of https://spp.org/western-services-documents/?id=370783.
Therefore, you might need to use crawlerOptions.allowBackwardCrawling
to capture these links. Additionally, try using the pageOptions.replaceAllPathsWithAbsolutePaths
to achieve the expected behavior for the ?id=370794
ones.
You can also check all options on our docs.
Let me know if this works!
Thanks for the info! Will take a look and report back.
Describe the Bug It appears the tool cannot follow certain types of links. I'm not a dev or an HTML junkie so, apologies if I've missed anything here. Thanks for taking a look. Great tool!
To Reproduce Steps to reproduce the issue:
https://spp.org/western-services-documents/?id=370794
The results look like:
Inspecting the page, I see the html is using some sort of short reference, and I think this is fouling up the crawl.![image](https://github.com/mendableai/firecrawl/assets/53322426/b085760f-a330-4db0-8fa2-b5865eca3027)
Expected Behavior I expected the tool to follow the links down, ultimately at the bottom of the tree there are documents I would like to gather in mass for reference. .pdf's etc.![image](https://github.com/mendableai/firecrawl/assets/53322426/24892210-1d73-421b-8f07-14c779601850)
The tool works fine if I run it on the exact page which has the documents linked, but there are many folders and manually browsing defeats the purpose.
Additional Context Add any other context about the problem here, such as configuration specifics, network conditions, data volumes, etc.