Respect robots.txt - Githubissues

libo26 / feedparser

Automatically exported from code.google.com/p/feedparser

Other

0 stars 0 forks source link

Respect robots.txt #153

Closed GoogleCodeExporter closed 9 years ago

GoogleCodeExporter commented 9 years ago

it would be nice if it was possible to make the parser check for robots.txt
for the website serving the feed, and - of course - respect the rules for
the site.

This could easily be achieved by using the robotparser:
<http://docs.python.org/library/robotparser.html>

-- 

Best regards, Mikkel

Original issue reported on code.google.com by 3...@detfalskested.dk on 21 Jan 2009 at 6:46

GoogleCodeExporter commented 9 years ago

Well, I just made it do it. Original file (feedparser.py from official download
feedparser-4.1.zip) and my modified file included in attached archive.

It messes with bozo and bozo_exception. I don't know if it's done the right 
way, but
at least the dirty work is done and it should (hopefully) be easy to make it 
fit into
how you'd like it.

-- 

Best regards, Mikkel

Original comment by 3...@detfalskested.dk on 21 Jan 2009 at 8:04

Attachments:

feedparser-robots.zip

GoogleCodeExporter commented 9 years ago

Please close this bug as invalid.

Feedparser doesn't spider webpages; it's a library, not an application. 
Respecting robots.txt is something that must be done at the application level.

Original comment by kurtmckee on 4 Dec 2010 at 4:17

GoogleCodeExporter commented 9 years ago

Original comment by adewale on 4 Dec 2010 at 10:41

Changed state: WontFix