vercel / next.js

The React Framework
https://nextjs.org
MIT License
123.47k stars 26.34k forks source link

OG Image tags don't get scraped by FB for dynamic routes (appdir) #44470

Open saarnav890 opened 1 year ago

saarnav890 commented 1 year ago

Verify canary release

Provide environment information

Operating System: Platform: darwin Arch: arm64 Version: Darwin Kernel Version 22.2.0: Fri Nov 11 02:03:51 PST 2022; root:xnu-8792.61.2~4/RELEASE_ARM64_T6000 Binaries: Node: 16.14.0 npm: 8.3.1 Yarn: 1.22.19 pnpm: N/A Relevant packages: next: 13.1.1-canary.1 eslint-config-next: N/A react: 18.2.0 react-dom: 18.2.0

Which area(s) of Next.js are affected? (leave empty if unsure)

Head component/file (next/head / head.js)

Link to the code that reproduces this issue

https://github.com/saarnav890/ogImageIssue

To Reproduce

To reproduce, try to run the FB sharing debugger, https://developers.facebook.com/tools/debug/?q=https%3A%2F%2Fog-image-issue.vercel.app%2F. (results in a good image)

Then, run the same debugger but with anything else as the /[slug], for instance, https://og-image-issue.vercel.app/something.

https://developers.facebook.com/tools/debug/?q=https%3A%2F%2Fog-image-issue.vercel.app%2Fsomething (results in a 500 internal server error)

Additionally, according to https://developers.facebook.com/docs/sharing/webmasters/crawler/ you can run

curl -v --compressed -H "Range: bytes=0-524288" -H "Connection: close" -A "facebookexternalhit/1.1 (+http://www.facebook.com/externalhit_uatext.php)" https://og-image-issue.vercel.app/

to get the proper response,

but running

curl -v --compressed -H "Range: bytes=0-524288" -H "Connection: close" -A "facebookexternalhit/1.1 (+http://www.facebook.com/externalhit_uatext.php)" https://og-image-issue.vercel.app/something

results in a 500 internal server error.

However, if you just remove the range header, the dynamic content works:

curl -v --compressed -H "Connection: close" -A "facebookexternalhit/1.1 (+http://www.facebook.com/externalhit_uatext.php)" https://og-image-issue.vercel.app/something

Describe the Bug

The FB crawler works perfectly fine when crawling statically generated pages. However, when trying to crawl dynamically generated pages, it gives a 500 internal server error.

This is the expected response for the index page:

image

This is the response for anything dynamically rendered:

image

Expected Behavior

The crawler should return the same image for both the index page and any slug page, however, this is not happening. When I try this with other meta tag simulators such as https://en.rakko.tools/tools/9/ both work perfectly fine. Because of this, I think it has something to do with the range header.

Which browser are you using? (if relevant)

Version 108.0.5359.124 (Official Build) (arm64)

How are you deploying your application? (if relevant)

Vercel

saarnav890 commented 1 year ago

Okay, so my initial prediction of the issue being the range header was correct. I wrote some middleware in order to ignore the range header and the scraping for the dynamic routes works great now.

If anyone else has this same problem just add a middleware.ts script into the root of the project and then what I did was add this code to essentially just delete the Range header. If you want to do something more complicated, you can check out https://nextjs.org/docs/advanced-features/middleware for more on middleware.

My code to ignore the range header:

import { NextRequest, NextResponse } from 'next/server'

export default function middleware(request: NextRequest) {
  const headers = new Headers(request.headers);
  headers.delete('Range');
  const responseWithoutRange = NextResponse.next({request: {headers}});
  return responseWithoutRange;
}

Edit: This seemed to make all my routes very slow compared to not using middleware at all so the workaround for this workaround is just by using a simple if check to make sure this is only being done if the request has the range header in the first place.

import { NextRequest, NextResponse } from "next/server";

export default function middleware(request: NextRequest) {
  if (request.headers.has("Range")) {
    const headers = new Headers(request.headers);
    headers.delete("Range");
    const responseWithoutRange = NextResponse.next({ request: { headers } });
    return responseWithoutRange;
  }
}

This is a workaround I found for now, but hopefully it gets officially fixed soon :)

GabenGar commented 1 year ago

This is more of a facebook crawler issue than a nextjs one, especially since the other tool worked just fine. Moreover, if you look at the source of example, the OG tag is there, which means nextjs is doing its streaming tech as expected by browsers. For this reason I suspect nextjs will not be able to "fix" it sans introducing a facebook-specific check for its crawler. Which might or might not bode well, since crawlers generally don't like when they get fed special page versions for them. For the slowness, by disabling Range header you basically disable response streaming, the whole point of the app directory structure. NextJS probably falls back to fully rendering the page in absence it, hence the perceived slow down.

Just for the reference, when you talk about "fb debugger", you only use it as an example because the actual prod setup doesn't show OG previews on facebook, right?

Culturalist commented 11 months ago

Is there any way to understand that request was made by a FB crawler and delete Range header only for those?

Culturalist commented 11 months ago

I tried to figure out what the heck, but failed. I am trying to get correct opengraph previews for inner pages of the website, but FB scraper gets totally messed up meta tags, when for example, Telegram processes the links just fine. For example, this link https://www.culturaweek.fi/fi/tapahtumat/konferenssi See the metadata and then check what FB is getting: https://developers.facebook.com/tools/debug/echo/?q=https%3A%2F%2Fwww.culturaweek.fi%2Ffi%2Ftapahtumat%2Fkonferenssi%2F

mikkeldamm commented 8 months ago

I experienced the same issue when Facebook was crawling the links on my site. I tried removing the "Range" header like @Culturalist suggested and it seemed to work, but again to avoid removing the header for all request make a check for whether its the facebook crawler or not.

Not totally sure that this check will catch all cases for the facebook crawler but with what I have tried, it works.

Here is the code I added to my middleware:

const headers = new Headers(req.headers);
if (
  req.headers.get("User-Agent")?.includes("facebookexternalhit") &&
  req.headers.has("Range")
) {
  headers.delete("Range");
}
// ...
return NextResponse.next({
  request: { headers },
});