Add a default robots.txt to nanopub-server

tkuhn / nanopub-server

A simple server to publish nanopublications

MIT License

11 stars 11 forks source link

Add a default robots.txt to nanopub-server #5

Open amalic opened 6 years ago

amalic commented 6 years ago

e.g.:

User-agent: Applebot
Allow: /

User-agent: baiduspider
Allow: /

User-agent: Bingbot
Allow: /

User-agent: Facebot
Allow: /

User-agent: Googlebot
Allow: /

User-agent: msnbot
Allow: /

User-agent: Naverbot
Allow: /

User-agent: seznambot
Allow: /

User-agent: Slurp
Allow: /

User-agent: teoma
Allow: /

User-agent: Twitterbot
Allow: /

User-agent: Yandex
Allow: /

User-agent: Yeti
Allow: /

User-agent: *
Disallow: /

tkuhn commented 6 years ago

Why do you think we need that?

And wouldn't the last two lines disallow nanopub servers to fetch nanopubs from each other?

amalic commented 6 years ago

Rationale is to prevent irrelevant webcrawlers generating tons of traffic.

Nanopubs-server would not be affected since Nanopubs-server does not process robots.txt files. It's more for webcrawlers, at least the ones that are behaving and respecting the content of robots.txt.

The above file is just an example (i think it was facebook.com/robots.txt)

tkuhn commented 6 years ago

OK, I see, but I think we should define a robot.txt that we ourselves are also respecting. And it should be allowed to write scripts to retrieve nanopubs, for example. I wouldn't want to declare this to be illegitimate.

Have you already experienced this problem of traffic by irrelevant webcrawlers with your server, or is this more of a preventive measure for the future?

amalic commented 6 years ago

Let's do this. Nanopubs currently identifies as Apache-HttpClient/4.3.4 (java 1.5). Shall we do simply "Nanopubs/1.0"?