scrapy / scrapy

Scrapy, a fast high-level web crawling & scraping framework for Python.
https://scrapy.org
BSD 3-Clause "New" or "Revised" License
50.99k stars 10.34k forks source link

More documentation needed about the robots.txt protocol #6288

Closed josegicar closed 1 month ago

josegicar commented 1 month ago

Summary

At the file "robotstxt.py", I recommend to add some more comments about how does the protocol works since it is not clear for some users.

Motivation

This suggestion was created so that people who read the robotstxt.py file know how it works and what the robots exclusion standard does. I took into account the issue #6244 where "mery16q" did not understand the robots protocol completely.

Describe alternatives you've considered

I did the pull request number 6287 where I added some comments at the start of the file on the route: “scrapy/downloadermiddlewares/robotstxt.py”. There you may understand how the robots.txt works.

wRAR commented 1 month ago

The docs already link to https://www.robotstxt.org/ , is that not enough?

josegicar commented 1 month ago

It could be better for users to look at it at the comments at the start of said file, not just putting the url :)

wRAR commented 1 month ago

The canonical place for the middleware documentation is the documentation, not the module docstrings.