unclecode / crawl4ai

🔥🕷️ Crawl4AI: Crawl Smarter, Faster, Freely. For AI.
https://crawl4ai.com
Apache License 2.0
17.01k stars 1.26k forks source link

Please respect robots.txt #107

Closed Joshix-1 closed 1 month ago

Joshix-1 commented 2 months ago

when crawling a website the robots.txt should be respected.

BradKML commented 2 months ago

Sometimes circumventing censorship requires flexibility. It should be an option tho

archer-321 commented 2 months ago

Sometimes circumventing censorship requires flexibility. It should be an option tho

robots.txt is not a means to censor anyone. Censorship means that you're prevented from expressing ideas or other content, and you're not expressing yourself when scraping a website.

If a website blocks a user agent in its robots.txt file, it means that the providers ask you not to scrape their website. Even wget respects this preference (by default), so it's only fair to ask a general-purpose scraping tool to at least do the same by default.

memoryhash commented 2 months ago

@Joshix-1 lol. if you don't want people to access things, then don't put them on the internet.

aravindkarnam commented 1 month ago

@Joshix-1 This is on the roadmap and it will be configurable.

BradKML commented 1 month ago

@memoryhash there is a difference between open access (telling Disallow to get bent) vs spamming server request (respecting Crawl-delay out of courtesy), but people mix the latter with the former and that is very unfortunate. For @archer-321 censorship is not just blocking freedom to express opinions, but also stopping the freedom to archive for historical purposes (looking at Internet Archive). People will throw lawyers just to try memory-hole the public.

memoryhash commented 1 month ago

@BradKML There is no point in having soft boundaries in this world. This is the internet. If server operators care/take issue with things, they can implement rate limits, client fingerprinting, user accounts and all sorts. It is naive and silly to expect people to abide by unenforceable soft boundaries. And even then, that's all pointless to anyone who actually knows what they are doing, it just gets rid of those doing small time work efforts and those with less experience and knowledge. Such is life.