mendableai / firecrawl

🔥 Turn entire websites into LLM-ready markdown or structured data. Scrape, crawl and extract with a single API.
https://firecrawl.dev
GNU Affero General Public License v3.0
19.08k stars 1.48k forks source link

[Feat] Improved includeMain content #440

Open calebpeffer opened 4 months ago

calebpeffer commented 4 months ago

p3nnywh1stl3 on the discord had a great suggestion for the tags to exclude to get tidy content from a website:

["script", "style", "nav", "header", "footer", ".advertisement", ".sidebar", ".nav", ".menu", "#comments", "img", "a"]

It would be interesting if we added these tags to the onlyMainContent in V1

juienpro commented 2 months ago

Why also not a whitelist?

I find strange that firecrawl ignores entirely the 'article' tag for news websites to extract the main content.

rafaelsideguide commented 1 month ago

currently excluded tags:

  "header",
  "footer",
  "nav",
  "aside",
  ".top",
  ".navbar",
  ".footer",
  ".bottom",
  "#footer",
  ".sidebar",
  ".side",
  ".aside",
  "#sidebar",
  ".modal",
  ".popup",
  "#modal",
  ".overlay",
  ".ad",
  ".ads",
  ".advert",
  "#ad",
  ".lang-selector",
  ".language",
  "#language-selector",
  ".social",
  ".social-media",
  ".social-links",
  "#social",
  ".menu",
  ".navigation",
  "#nav",
  ".breadcrumbs",
  "#breadcrumbs",
  "#search-form",
  ".search",
  "#search",
  ".share",
  "#share",
  ".cookie",
  "#cookie"

https://github.com/mendableai/firecrawl/blob/79e65f31ef1d7a4172870471d81501ee2e8aef22/apps/api/src/scraper/WebScraper/utils/excludeTags.ts