unclecode / crawl4ai

🔥🕷️ Crawl4AI: Open-source LLM Friendly Web Crawler & Scrapper
Apache License 2.0
16.37k stars 1.2k forks source link

Incorrect scraped content (another page's content is scraped) #267

Open jtha opened 1 week ago

jtha commented 1 week ago

I noticed some strange behaviour when I was doing retrieval and it turns out I'm seeing wrong page content for the url provided. I have replicated this a few times and so far it looks like it's triggered when setting magic=True. My sense is simulating user behaviour might be resulting in inadvertently clicking on a link on the page?

Turning this off and enabling the protection methods except for simulate_user=True seems to make it behave as intended, at least as far as I can see. For reference this was happening on Weaviate's documentation page with many links on the nav bar, side bar, main content area, basically links everywhere.

unclecode commented 2 days ago

@jtha Thx for using our library, let me work on this and see what is going on over there.