lumeland / lume

🔥 Static site generator for Deno 🦕
https://lume.land
MIT License
1.92k stars 92 forks source link

robots.txt plugin #543

Closed oscarotero closed 9 months ago

oscarotero commented 10 months ago

Enter your suggestions in details:

It would be usefult to block AI agents, for example: https://darkvisitors.com/robots-txt-builder

kwaa commented 10 months ago

For the Options type, perhaps we can refer to nuxt-modules/robots. (Maybe convert it to snake case / camel case?)

// PascalCase
site.use(robots([
  {
    UserAgent: 'ChatGPT-User',
    Disallow: '/',
  },
  {
    Comment: 'Sitemap',
    Sitemap: 'https://lume.land/sitemap.xml',
  }
]))

// snake_case
site.use(robots([
  {
    user_agent: 'ChatGPT-User',
    disallow: '/',
  },
  {
    comment: 'Sitemap',
    sitemap: 'https://lume.land/sitemap.xml',
  }
]))
User-agent: ChatGPT-User
Disallow: /

# Sitemap
Sitemap: https://lume.land/sitemap.xml
oscarotero commented 10 months ago

That's a good reference, thanks! But Lume plugins always use objects for options, so maybe this structure fits better:

site.use(robots({
  agents: [
    {
      name: "ChatGPT-User",
      disallow: "/",
    },
  ],
  sitemap: "https://lume.land/sitemap.xml"
}));

I'd like to include some shortcuts to make it more ergonomic:

site.use(robots({
  agents: [
    "ChatGP-User", // shortcut to "disalow: /",
    "AI", // Shortcut to all AI agents.
  ]
}));
kwaa commented 10 months ago

That's a good reference, thanks! But Lume plugins always use objects for options, so maybe this structure fits better:

Perhaps it would be more appropriate to use the disallow keyword rather than agents.

site.use(robots({
  disallow: ['ChatGPT-User'],
  rules: [{
    userAgent: '*',
    allow: '/'
  }],
  sitemap: 'https://lume.land/sitemap.xml',
}))

Also, I don't think maintaining an AI agents list is quite necessary.

oscarotero commented 10 months ago

Thinking of privacity and good defaults, maybe the plugin should disable access by default and only grant access to bots explicity defined. For example:

site.use(robots({
  allow: ["Google", "Bing", "Yahoo", "ChatGPT"],
  paths: "/"
}));

This would generate this file:

User-Agent: *
Disallow: /

User-Agent: Googlebot
Allow: /

User-Agent: Bingbot
Allow: /

User-Agent: Yahoo-MMCrawler
Allow: /

User-Agent: ChatGPT-User
Allow: /
kwaa commented 10 months ago

Thinking of privacity and good defaults, maybe the plugin should disable access by default and only grant access to bots explicity defined.

Ideally can let users set up to use blacklist or whitelist mode.

site.use(robots({
  whitelist: true,
  allow: ['ChatGPT-User']
}))

site.use(robots({
  // whitelist: false, (default)
  disallow: ['ChatGPT-User']
}))
kwaa commented 10 months ago

Given the myriad of possible UserAgent values, I likewise think it's best not to manage name conversions. (like Google => Googlebot, ChatGPT => ChatGPT-User)

https://darkvisitors.com/agents