Closed QuangTQV closed 1 month ago
@QuangTQV There are multiple flags and parameters you can use to control the level of data-cleaning you may need. Please share with me your URL, then I give you an example. However I share one example that I think will answer your question:
async def main():
async with AsyncWebCrawler(headless = True, sleep_on_close = True) as crawler:
url = "https://janineintheworld.com/places-to-visit-in-central-mexico"
result = await crawler.arun(
url=url,
# bypass_cache=True,
word_count_threshold = 10,
excluded_tags = ['form'], # Optional - Default is None, this adds more control over the content extraction for markdown
exclude_external_links = False, # Default is True
exclude_social_media_links = True, # Default is True
exclude_external_images = True, # Default is False
# social_media_domains = ["facebook.com", "twitter.com", "instagram.com", ...] Here you can add more domains, default supported domains are in config.py
html2text = {
"escape_dot": False,
# Add more options here
}
)
# Save markdown to file
with open(os.path.join(__data, "mexico_places.md"), "w") as f:
f.write(result.markdown)
print("Done")
Here you can see some flags and parameters that allow you to focus more on the content relevant to what you're searching for. This is a website about places that can be visited in Mexico, where you can see the excluded_tags
that you can pass. You can see that here we have parameters to exclude or include external links, and the same goes for social media links such as Facebook, Pinterest, and so on. You can also pass extra domains that you want to be excluded from the links, and the same thing goes for images. Our markdown generator uses Html2Text and you can override some of the standard parameters to use it in an advanced way. However, please share your links and I'll provide you with a customized code for your specific link.
I close this issue and please follow this on https://github.com/unclecode/crawl4ai/issues/181
How can I extract only the cleaned markdown? Right now it contains headers, footers, advertisements, etc.