Do not add pages marked with `noindex` to sitemap

iamvishnusankar / next-sitemap

Sitemap generator for next.js. Generate sitemap(s) and robots.txt for all static/pre-rendered/dynamic/server-side pages.

https://next-sitemap.iamvishnusankar.com

MIT License

3.24k stars 124 forks source link

Do not add pages marked with `noindex` to sitemap #182

Closed fedeci closed 2 years ago

fedeci commented 3 years ago

Is your feature request related to a problem? Please describe. Pages with <meta content="noindex, follow" name="robots" /> should not be added to the sitemap. I am not sure about how this library works, but I don't think it actually reads the content of the files so it may be hard to detect that meta tag.

eduncan911 commented 3 years ago

With other static site generators (Hugo, Octopress, Gatsby, etc), you typically mark the post as no index in the YAML frontmatter, or some programmable way if pages, etc. This is because those generators follow a specific format and form of posts and pages.

With NextJS, you are free to do whatever you want. So, you have to write a bit of code to match the condition you set.

IMO, good sitemap generation tools that are part of established formats usually honor all three:

Frontmatter config
Programmable API via the SDK (node module)
Config file exclusion list

From what I see, this package supports the config file exclusions out of the box (see config file options).

As for Frontmatter parsing or some other form of trigging a "skip me!!" option on a per page basis, you can do that in the Transformations. Just test for your condition, and return null to skip, as the example transformation shows you.

fedeci commented 3 years ago

Thanks @eduncan911, I am already doing it in the transformations, however it would be great if it was possible to integrate it directly in the lib. I'll probably fork it and open a PR as soon as possible.

gabrielreisn commented 2 years ago

hey, in case someone is still missing this, the solution I'm currently using within my team is a custom transform function. If that's ok I can open a PR with the fix

transform: async (config, path) => {
    const noIndexRegex = /<meta.*noindex/gim
    const basePath = '.next/serverless/pages'
    const filePath = `${basePath + path}.html`

    if (fs.existsSync(filePath)) {
      try {
        const data = await fs.promises.readFile(filePath, 'utf8')

        if (data.match(noIndexRegex)) {
          console.log('ignored file:', filePath)

          return null
        }
      } catch (error) {
        console.error('err', error)
      }
    }

    return {
      loc: path,
      changefreq: config.changefreq,
      priority: config.priority,
      lastmod: config.autoLastmod ? new Date().toISOString() : undefined,
      alternateRefs: config.alternateRefs || [],
    }
  },

rserafim commented 2 years ago

I remove like this

module.exports = { siteUrl: 'https://www.xxxx, exclude: ['/aaa/', '/xxx', '/yyyy'], // <= exclude here

zacharias-pavlatos commented 2 years ago

I remove like this

module.exports = { siteUrl: 'https://www.xxxx, exclude: ['/aaa/', '/xxx', '/yyyy'], // <= exclude here

Yeap but this is not dynamic... 👎

GautheyValentin commented 2 years ago

hey, in case someone is still missing this, the solution I'm currently using within my team is a custom transform function. If that's ok I can open a PR with the fix

transform: async (config, path) => {
    const noIndexRegex = /<meta.*noindex/gim
    const basePath = '.next/serverless/pages'
    const filePath = `${basePath + path}.html`

    if (fs.existsSync(filePath)) {
      try {
        const data = await fs.promises.readFile(filePath, 'utf8')

        if (data.match(noIndexRegex)) {
          console.log('ignored file:', filePath)

          return null
        }
      } catch (error) {
        console.error('err', error)
      }
    }

    return {
      loc: path,
      changefreq: config.changefreq,
      priority: config.priority,
      lastmod: config.autoLastmod ? new Date().toISOString() : undefined,
      alternateRefs: config.alternateRefs || [],
    }
  },

It's always working but on version "next": "12.2.5" path change (Maybe before i don't know exactly)

Before

 const basePath = '.next/serverless/pages'

After

 const basePath = '.next/server/pages'

VicDosya commented 1 month ago

Thank you @GautheyValentin & @gabrielreisn !! this works great.