mendableai / firecrawl

🔥 Turn entire websites into LLM-ready markdown or structured data. Scrape, crawl and extract with a single API.
https://firecrawl.dev
GNU Affero General Public License v3.0
8.68k stars 634 forks source link

[BUG] Some types of links not followed #377

Open rkroelin opened 4 weeks ago

rkroelin commented 4 weeks ago

Describe the Bug It appears the tool cannot follow certain types of links. I'm not a dev or an HTML junkie so, apologies if I've missed anything here. Thanks for taking a look. Great tool!

To Reproduce Steps to reproduce the issue:

  1. When I attempt to use the crawl feature on the below website, the tool does not seem to follow links certain links.

https://spp.org/western-services-documents/?id=370794

The results look like:

image

Inspecting the page, I see the html is using some sort of short reference, and I think this is fouling up the crawl. image

Expected Behavior I expected the tool to follow the links down, ultimately at the bottom of the tree there are documents I would like to gather in mass for reference. .pdf's etc. image

The tool works fine if I run it on the exact page which has the documents linked, but there are many folders and manually browsing defeats the purpose.

Additional Context Add any other context about the problem here, such as configuration specifics, network conditions, data volumes, etc.

import os
from firecrawl import FirecrawlApp

app = FirecrawlApp(api_key=<>")

crawl_result = app.crawl_url(
    "spp.org/western-services-documents/?id=370783",
    {
        "crawlerOptions": {
            # "includes": ["western-services-documents/*"],
            "maxDepth": ["50"],
            # "returnOnlyUrls":["true"]
        }
    },
    {
        "pageOptions": {
            "onlyMainContent": ["true"],
        }
    },
)

# # Check the structure of crawl_result
# print(crawl_result)

# Write the markdown to a file
with open('results.md', 'w') as f:
    for result in crawl_result:
        # Ensure result is a dictionary and has the key 'markdown'
        if isinstance(result, dict) and 'markdown' in result:
            f.write(result['markdown'] + "\n")
rafaelsideguide commented 3 weeks ago

@rkroelin I quickly checked the page, and it seems like the links you're interested in are not children of the original page. For example, https://spp.org/western-services-documents?id=296577#:~:text=Markets%2B%20Governance%20Nominations is not a child link of https://spp.org/western-services-documents/?id=370783.

Therefore, you might need to use crawlerOptions.allowBackwardCrawling to capture these links. Additionally, try using the pageOptions.replaceAllPathsWithAbsolutePaths to achieve the expected behavior for the ?id=370794 ones.

You can also check all options on our docs.

Let me know if this works!

rkroelin commented 3 weeks ago

Thanks for the info! Will take a look and report back.