sebsauvage / Shaarli

The personal, minimalist, super-fast, no-database delicious clone.
http://sebsauvage.net/wiki/doku.php?id=php:shaarli
Other
676 stars 390 forks source link

Functionnality to give crawlers possibility to know some options #96

Open Nabellaleen opened 11 years ago

Nabellaleen commented 11 years ago

In many projects around Shaarli, we talk about crawlers, getting some things from a list of Shaarli.

To avoid a larger amount of request done on our shaarli for services we don't want, we could imagin a sort of "robot.txt", read by the crawlers, to know if they have the right, or not, to do their job.

Solutions are :

In these solutions, the robot find a list of keywords, and if the one corresponding to its fonctionnality is found, it is authorized to do its job.

Nabellaleen commented 11 years ago

It could be fine to use the options.php to define the list of keyword

ghost commented 11 years ago

A good idea is to use index.php?do=robot which will parse options.php to find specific options dedicated to robots.

Sample for options.php

$GLOBALS['config']['ROBOT']['SHARE_CSS'] = true;
$GLOBALS['config']['ROBOT']['AUTHORIZE_PLANET'] = false;

etc ... :smile:

nodiscc commented 9 years ago

robots.txt works well for that kind of thing and is rather standard (http://www.robotstxt.org/orig.html). Crawlers should to announce themselves with a proper user-agent and respect the robots.txt if they want (for example disallow access to ?do=rss for User-agent: myshaarlicrawler. Up to myshaarlicrawler to respect/ignore this). You could also add meta tags specific to shaarli (eg. <META NAME="SHAARLI" CONTENT="NORSS, NOSHARECSS">) but again, the crawler will also have to implement this.

@Nabellaleen @tsyr2ko I suggest we close this issue as it's not directly related to Shaarli. Or do you know a crawler/planet that already implements this?

RSS (and HTML exports) already have what we need for filtering (by tag/private/public/search). There is a very early draft of a shaarli backup/data extraction tool here https://github.com/nodiscc/shaarchiver (only extracts selected info from HTML exports now, but RSS support is planned)