Open calebpeffer opened 4 months ago
Why also not a whitelist?
I find strange that firecrawl ignores entirely the 'article' tag for news websites to extract the main content.
currently excluded tags:
"header",
"footer",
"nav",
"aside",
".top",
".navbar",
".footer",
".bottom",
"#footer",
".sidebar",
".side",
".aside",
"#sidebar",
".modal",
".popup",
"#modal",
".overlay",
".ad",
".ads",
".advert",
"#ad",
".lang-selector",
".language",
"#language-selector",
".social",
".social-media",
".social-links",
"#social",
".menu",
".navigation",
"#nav",
".breadcrumbs",
"#breadcrumbs",
"#search-form",
".search",
"#search",
".share",
"#share",
".cookie",
"#cookie"
p3nnywh1stl3 on the discord had a great suggestion for the tags to exclude to get tidy content from a website:
["script", "style", "nav", "header", "footer", ".advertisement", ".sidebar", ".nav", ".menu", "#comments", "img", "a"]
It would be interesting if we added these tags to the onlyMainContent in V1