jshemas / openGraphScraper

Node.js scraper service for Open Graph Info and More!
MIT License
643 stars 102 forks source link

Page not found #101

Closed jplonina closed 3 years ago

jplonina commented 3 years ago

Hi. Thank you for your work on this library. I am really happy to use it.

Recently I got an issue while trying to get data from https://cookpad.com/ru/recipes/7345377-piechienie-banan-s-tvoroghom.

I got the error:

{
  error: true,
  result: {
    success: false,
    requestUrl: 'https://cookpad.com/ru/recipes/7345377-piechienie-banan-s-tvoroghom',
    error: 'Page not found',
    errorDetails: Error: Page not found
        at setOptionsAndReturnOpenGraphResults (.../node_modules/open-graph-scraper/lib/openGraphScraper.js:174:13)
        at processTicksAndRejections (internal/process/task_queues.js:85:5)
  }
}

My version of OGS is 4.4.0 and options for the request are:

    headers: {
      'user-agent':
        'facebookexternalhit/1.1 (+http://www.facebook.com/externalhit_uatext.php)'
    },
    timeout: 10000,
    ogImageFallback: false,
    onlyGetOpenGraphInfo: false

Do you have an idea why this might happen?

jshemas commented 3 years ago

Hello,

It looks like cookpad.com is blocking that user-agent, it returns a Response code 403 (Forbidden) error.

Requests works without that user-agent.

jplonina commented 3 years ago

I removed headers and got the same result - Response code 403 (Forbidden) for url https://cookpad.com/ru/recipes/7345377-piechienie-banan-s-tvoroghom.

I tried also to get data for https://cookpad.com/ru and got the same error.

But interestingly for https://cookpad.com it worked correctly

  error: false,
  result: {
    ogUrl: 'https://cookpad.com/',
    ogTitle: 'レシピ検索No.1/料理レシピ載せるなら クックパッド',
    ogType: 'website',
    twitterCard: 'summary',
    twitterSite: '@cookpad_pr',
    twitterUrl: 'https://cookpad.com/',
    twitterTitle: 'レシピ検索No.1/料理レシピ載せるなら クックパッド',
    ogImage: {
      url: 'https://cookpad.com/assets/logos/og_image_1200x630.png',
      width: '1200',
      height: '630',
      type: 'png'
    },
    twitterImage: {
      url: 'https://cookpad.com/assets/logos/twitter_image_560x300.png',
      width: '560',
      height: '300',
      alt: null
    },
    ogDescription: '日本最大の料理レシピサービス。335万品を超えるレシピ、作り方を検索できる。家庭の主婦の作った簡単実用レシピが多い。利用者は5400万人。自分のレシピを公開できる。',
    charset: 'utf8',
    requestUrl: 'https://cookpad.com',
    success: true
  },
...
jshemas commented 3 years ago

Hitting https://cookpad.com/ru/recipes/7345377-piechienie-banan-s-tvoroghom returns

result: {
  ogSiteName: 'Cookpad',
  ogTitle: 'Печенье банан с творогом - пошаговый рецепт с фото. Автор рецепта Айсель .',
  ogDescription: 'Печенье банан с творогом - пошаговый рецепт с фото. ',
  ogUrl: 'https://cookpad.com/ru/recipes/7345377-piechienie-banan-s-tvoroghom',
  twitterCard: 'summary_large_image',
  twitterTitle: 'Печенье банан с творогом',
  ogImage: {
    url: 'https://img-global.cpcdn.com/recipes/a995f16516df520f/1200x630cq70/photo.jpg',
    width: '1200',
    height: '630',
    type: 'jpg'
  },
  twitterImage: {
    url: 'https://img-global.cpcdn.com/recipes/a995f16516df520f/1200x630cq70/photo.jpg',
    width: null,
    height: null,
    alt: null
  },
  ogLocale: 'ru',
  ogDate: '2019-02-26T06:56:06Z',
  charset: 'utf8',
  requestUrl: 'https://cookpad.com/ru/recipes/7345377-piechienie-banan-s-tvoroghom',
  success: true
}

You might have to use a proxy for the request. You could do the http request in your own app server with a proxy and just pass the HTML results into ogs.

jplonina commented 3 years ago

Thanks a lot for your time, it helped!

armandolio commented 3 years ago

Hi @jshemas can you provide an example of how to implement the proxy solution?

Same problem trying to reach https://twitter.com/Austen

Thanks!

jshemas commented 3 years ago

@armandoarmando can you open a new issue? You can try setting a user-agent like this issue: https://github.com/jshemas/openGraphScraper/issues/61