Closed Joshix-1 closed 1 month ago
Sometimes circumventing censorship requires flexibility. It should be an option tho
Sometimes circumventing censorship requires flexibility. It should be an option tho
robots.txt
is not a means to censor anyone. Censorship means that you're prevented from expressing ideas or other content, and you're not expressing yourself when scraping a website.
If a website blocks a user agent in its robots.txt
file, it means that the providers ask you not to scrape their website.
Even wget respects this preference (by default), so it's only fair to ask a general-purpose scraping tool to at least do the same by default.
@Joshix-1 lol. if you don't want people to access things, then don't put them on the internet.
@Joshix-1 This is on the roadmap and it will be configurable.
@memoryhash there is a difference between open access (telling Disallow
to get bent) vs spamming server request (respecting Crawl-delay
out of courtesy), but people mix the latter with the former and that is very unfortunate.
For @archer-321 censorship is not just blocking freedom to express opinions, but also stopping the freedom to archive for historical purposes (looking at Internet Archive). People will throw lawyers just to try memory-hole the public.
@BradKML There is no point in having soft boundaries in this world. This is the internet. If server operators care/take issue with things, they can implement rate limits, client fingerprinting, user accounts and all sorts. It is naive and silly to expect people to abide by unenforceable soft boundaries. And even then, that's all pointless to anyone who actually knows what they are doing, it just gets rid of those doing small time work efforts and those with less experience and knowledge. Such is life.
when crawling a website the robots.txt should be respected.