mendableai / firecrawl

🔥 Turn entire websites into LLM-ready markdown or structured data. Scrape, crawl and extract with a single API.
https://firecrawl.dev
GNU Affero General Public License v3.0
16.14k stars 1.16k forks source link

[Bug] `format: ["links"]` doesn't respect `excludeTags` #701

Open mogery opened 1 week ago

mogery commented 1 week ago

Discord thread

const INPUT_URL = "https://www.jndla.com/cases/class-action-administration"

const response = await app.scrapeUrl(INPUT_URL, {
  formats: ["links"],
  excludeTags: ["a"],
})

console.log(response.links)
// this still has links, even though we excluded the <a> tags
baraich commented 1 week ago

I was unable to reproduce the error, you can cross check at official website demo.

mogery commented 1 week ago

I was unable to reproduce the error, you can cross check at official website demo.

The demo has the error. The correct behaviour would be for the links array to be empty.

baraich commented 1 week ago

Oh. So, we need that when we exclude the a tags, links should be [].
Let's take a look into another perspective. What if, I need to exclude a tags from the markup however, I need them in the links array. The reason I excluded the tags, because I don't want them to show up in the markup.

mogery commented 1 week ago

Is this a hypothetical scenario or do you actually need this functionality?

baraich commented 1 week ago

Actually, I haven't even used the library before and I am not sure about it's working. To answer you question, its a hypothetical situation that I though might be useful.

What you can do is, you can explicitly set the links to an [].