taganaka / polipus

Polipus: distributed and scalable web-crawler framework
MIT License
92 stars 32 forks source link

Dofollow links #1

Closed hendricius closed 10 years ago

hendricius commented 10 years ago

Adjusting some settings where pages should sometimes not be indexed.

taganaka commented 10 years ago

I like the overall idea, I would prefer for having it configurable rather then "hardcoded".

This might be part of my todolist to add the support for robot.txt

What do you think?

hendricius commented 10 years ago

yep, would be a good idea to have to have robots.txt support as well. I'll have a look at the anemone project.

hendricius commented 10 years ago

Yep, seems like we could just use this: https://github.com/chriskite/robotex

Let's open a branch for robots.txt support then. This will be helpful.

Would you like this in the core, or as a plugin preferably?

taganaka commented 10 years ago

Plugin sounds perfect for this use case. Thanks to jump into this!