Open Nabellaleen opened 11 years ago
It could be fine to use the options.php to define the list of keyword
A good idea is to use index.php?do=robot which will parse options.php to find specific options dedicated to robots.
Sample for options.php
$GLOBALS['config']['ROBOT']['SHARE_CSS'] = true;
$GLOBALS['config']['ROBOT']['AUTHORIZE_PLANET'] = false;
etc ... :smile:
robots.txt works well for that kind of thing and is rather standard (http://www.robotstxt.org/orig.html). Crawlers should to announce themselves with a proper user-agent and respect the robots.txt if they want (for example disallow access to ?do=rss
for User-agent: myshaarlicrawler
. Up to myshaarlicrawler to respect/ignore this). You could also add meta tags specific to shaarli (eg. <META NAME="SHAARLI" CONTENT="NORSS, NOSHARECSS">
) but again, the crawler will also have to implement this.
@Nabellaleen @tsyr2ko I suggest we close this issue as it's not directly related to Shaarli. Or do you know a crawler/planet that already implements this?
RSS (and HTML exports) already have what we need for filtering (by tag/private/public/search). There is a very early draft of a shaarli backup/data extraction tool here https://github.com/nodiscc/shaarchiver (only extracts selected info from HTML exports now, but RSS support is planned)
In many projects around Shaarli, we talk about crawlers, getting some things from a list of Shaarli.
To avoid a larger amount of request done on our shaarli for services we don't want, we could imagin a sort of "robot.txt", read by the crawlers, to know if they have the right, or not, to do their job.
Solutions are :
In these solutions, the robot find a list of keywords, and if the one corresponding to its fonctionnality is found, it is authorized to do its job.