vercel / next.js

The React Framework
https://nextjs.org
MIT License
127.55k stars 27.05k forks source link

initialCanonicalUrl is not taking into account basePath from config #59971

Closed zlwaterfield closed 5 months ago

zlwaterfield commented 11 months ago

Link to the code that reproduces this issue

https://github.com/zlwaterfield/initial-canonical-url-bug

To Reproduce

  1. Create a Next.js app that uses a basePath (for example '/g/`) and the app router (or look at example repo)
  2. Create and load a page in the application (/path/testing-page in example repo)
  3. Inspect the page and find the script with self.__next_f.push(....
  4. Find the initialCanonicalUrl and it will be missing the base path (path).

Also see https://github.com/vercel/next.js/issues/53274 for more information.

Current vs. Expected behavior

The initialCanonicalUrl should have the basePath included in it.

Verify canary release

Provide environment information

Operating System:
  Platform: darwin
  Arch: arm64
  Version: Darwin Kernel Version 23.1.0: Mon Oct  9 21:27:24 PDT 2023; root:xnu-10002.41.9~6/RELEASE_ARM64_T6000
Binaries:
  Node: 18.17.0
  npm: 9.6.7
  Yarn: 1.22.19
  pnpm: 8.12.1
Relevant Packages:
  next: 14.0.4
  eslint-config-next: N/A
  react: 18.2.0
  react-dom: 18.2.0
  typescript: N/A
Next.js Config:
  output: N/A

Which area(s) are affected? (Select all that apply)

App Router, Metadata (metadata, generateMetadata, next/head), Script optimization (next/script)

Additional context

This is causing issues with SEO because crawlers see the URL and think it's a valid URL. From what I gather, there is currently no way to properly set it. We are getting 404s from this in the Google Search Console.

My only idea to fix it right now is to rewrite the URL in our Cloudflare worker until a fix is shipped in Next.js

omerman commented 5 months ago

Why is this not addressed :/ I have tons of 404 urls because of it in my search console 😓

huozhi commented 5 months ago

Google won't pick the initialCanonicalUrl in tht html respnose for SEO, that value is only for internal state. The canonical url should be configured through Metadata API through alternates.canonical. https://nextjs.org/docs/app/api-reference/functions/generate-metadata then google can pick it up properly.

c0b41 commented 5 months ago

@huozhi can you stop closing issue, everyone have same issue, crawler picking up everything self.__next_f inside, i have so many 404 url's

for refs #53274 #40143 #41433

huozhi commented 5 months ago

@c0b41 If the assumption is that google crawler read those content and parse it as canonical url, I'd assume there will be a much wider impact. Or it could also be search console having issues with specific app. There're only screenshots in 40143 that is not available to investigate.

omerman commented 5 months ago

I will be happy to conduct a google meet and show you my own search console how thousands of urls are considered 404 by google because of initialCanonical. Moreover, in another project i had to go over 40k static pages i have and add a script to modify this variable so that google wont complain about it 🤷‍♂️

arun-kambhammettu commented 5 months ago

This shouldn't be closed, 404s and 308, Google is picking up initialCanonicalUrl

huozhi commented 5 months ago

I wonder if it's related to this fix (#67135), when you have a static not found page, but since it's missing noindex so that google still indexed it but actually it should be ignored.

omerman commented 5 months ago

@huozhi To be honest I dont think so, my site is not statically generated, and I see a noindex tag within the 404 pages. My site's version does not include #67135 fix yet. I think it's much more simple than that.. I think google simply inspects the content of the page (just like a simple view source) and it recognizes variables that matches the pattern of links.. e.g contains slashes... and simply treats those as "links" coming from the page.. That's my theory.. I think that, because whenever I have a page in a folder like: [...paths], I see that the paths variable which is also embeded inside the inline content of the page, is also being considered as links by google.. #40143

github-actions[bot] commented 4 months ago

This closed issue has been automatically locked because it had no new activity for 2 weeks. If you are running into a similar issue, please create a new issue with the steps to reproduce. Thank you.