jshemas / openGraphScraper

Node.js scraper service for Open Graph Info and More!
MIT License
643 stars 102 forks source link

Regression: Fetching `https://facebook.com` yields no image #184

Closed adarhef closed 1 year ago

adarhef commented 1 year ago

Describe the bug Regression from 5.2.3: fetching for https://facebook.com on 6.0.1 yields no image. On 5.2.3 it does.

To Reproduce Try fetching for the aforementioned link

Expected behavior Expecting to see a non-null ogImage array

jshemas commented 1 year ago

Hello.

Normally I would say this is proxy/headers issue since most big sites try to block scrapers. In this case it looks like there is something wrong with the request being made.

If you use packages GOT or node-fetch and pass in just the facebook URL, it will send back a page that looks something like...

Capture

But if you use node's fetch API await fetch('https://www.facebook.com/') Capture

I'm guessing one(or more) of the default options undici uses is causing facebook to return a error page.

Even something like the follow leads to an error page:

  const userAgent = 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/112.0.0.0 Safari/537.36';
  const headers = new Headers({
    'user-agent': userAgent,
  });
  const request = await fetch('https://www.facebook.com/', { credentials: 'omit', redirect: 'follow', headers });
  const html = await request.text();
  console.log('html:', html);
jshemas commented 1 year ago

Actually, I had one last idea.

  const request = await fetch('http://www.facebook.com/', { referrer: 'http://www.facebook.com' });
  const html = await request.text();
  console.log('html:', html);

Setting the referrer to the same site you are requesting seems to fix the issue.

So to get this working in OGS, you would do the following:

ogs({ url: 'https://www.facebook.com/', fetchOptions: { referrer: 'https://www.facebook.com' } })

Not sure if I want to do this by default in OGS, but this should unblock your current issue.

adarhef commented 1 year ago

Actually, I had one last idea.


  const request = await fetch('http://www.facebook.com/', { referrer: 'http://www.facebook.com' });

  const html = await request.text();

  console.log('html:', html);

Setting the referrer to the same site you are requesting seems to fix the issue.

So to get this working in OGS, you would do the following:


ogs({ url: 'https://www.facebook.com/', fetchOptions: { referrer: 'https://www.facebook.com' } })

Not sure if I want to do this by default in OGS, but this should unblock your current issue.

I actually haven't checked the final request that was sent in this case. I had to revert to 5.2.3 and it'll be a while before I can experiment again. Does got contain a referer at all? If so what was it? Maybe it was hardcoded to something. What if I were to set fetch to some other referer? (Like the website I'm actually sort of referring from, instead of Facebook). I imagine other websites might not like the fetch api for similar reasons but I haven't done extensive testing.

jshemas commented 1 year ago

Hello, this should be fixed in open-graph-scraper@6.1.0.

Short answer: It looks like fetch always sets the sec-fetch-mode header and there doesn't seem to be a way to remove it. Facebook errors out when this header is set and the referrer/origin header is null, so for now I'm going to default the origin header to request url. Users can overwrite this header if needed.

adarhef commented 1 year ago

Hello, this should be fixed in open-graph-scraper@6.1.0.

Short answer: It looks like fetch always sets the sec-fetch-mode header and there doesn't seem to be a way to remove it. Facebook errors out when this header is set and the referrer/origin header is null, so for now I'm going to default the origin header to request url. Users can overwrite this header if needed.

Sounds great! Thank you!