laurengarcia / url-metadata

NPM module: Request a url and scrape the metadata from its HTML using Node.js or the browser.
https://www.npmjs.com/package/url-metadata
MIT License
166 stars 44 forks source link

Error: redirect count exceeded #60

Closed nmassi closed 1 year ago

nmassi commented 1 year ago

There are issues with Twitter urls. Is this something we can fix somehow?

laurengarcia commented 1 year ago

Tried fiddling around with the fetch api request options but kept getting same. Am assuming its due to Elon's crusade to stamp out scraping. Also, note that their robots.txt explicitly disallows crawlers from scraping most tweets now. If you find a workaround feel free to reopen this issue.

https://twitter.com/robots.txt

# Google Search Engine Robot
# ==========================
User-agent: Googlebot

Allow: /*?lang=
Allow: /hashtag/*?src=
Allow: /search?q=%23
Allow: /i/api/
Disallow: /search/realtime
Disallow: /search/users
Disallow: /search/*/grid

Disallow: /*?
Disallow: /*/followers
Disallow: /*/following

Disallow: /account/deactivated
Disallow: /settings/deactivated

Disallow: /[_0-9a-zA-Z]+/status/[0-9]+/likes
Disallow: /[_0-9a-zA-Z]+/status/[0-9]+/retweets
Disallow: /[_0-9a-zA-Z]+/likes
Disallow: /[_0-9a-zA-Z]+/media 
Disallow: /[_0-9a-zA-Z]+/photo

# Every bot that might possibly read and respect this file
# ========================================================
User-agent: *
Disallow: /

# WHAT-4882 - Block indexing of links in notification emails. This applies to all bots.
# =====================================================================================
Disallow: /i/u
Noindex: /i/u

# Wait 1 second between successive requests. See ONBOARD-2698 for details.
Crawl-delay: 1

# Independent of user agent. Links in the sitemap are full URLs using https:// and need to match
# the protocol of the sitemap.
Sitemap: https://twitter.com/sitemap.xml
laurengarcia commented 1 year ago

Following up: I can get a response to the https://twitter.com/sitemap.xml page listed at the bottom of the x.com robots.txt file using any User-Agent other than GoogleBot, which if used returns 404 (lol).

This behavior very much seems like a choice by x.com. I think the reason they're using an endless redirect loop on certain paths (ex: /status/01234...) is to run up cloud bills for competitors as much as possible. Well-played 😂

Try for yourself, this is the only path I've found that yields a good response:

  const url = "https://twitter.com/sitemap.xml" //<-- works w "User-Agent": "ZZZ", no "From"
  try {
    const metadata = await urlMetadata(url, {
      requestHeaders: {
        'User-Agent': 'ZZZ'
      }
    })
    console.log(metadata)
  } catch (err) {
    console.log(err)
    expect(err).toBe(undefined)
  }