Closed nmassi closed 1 year ago
Tried fiddling around with the fetch api request options but kept getting same. Am assuming its due to Elon's crusade to stamp out scraping. Also, note that their robots.txt explicitly disallows crawlers from scraping most tweets now. If you find a workaround feel free to reopen this issue.
https://twitter.com/robots.txt
# Google Search Engine Robot
# ==========================
User-agent: Googlebot
Allow: /*?lang=
Allow: /hashtag/*?src=
Allow: /search?q=%23
Allow: /i/api/
Disallow: /search/realtime
Disallow: /search/users
Disallow: /search/*/grid
Disallow: /*?
Disallow: /*/followers
Disallow: /*/following
Disallow: /account/deactivated
Disallow: /settings/deactivated
Disallow: /[_0-9a-zA-Z]+/status/[0-9]+/likes
Disallow: /[_0-9a-zA-Z]+/status/[0-9]+/retweets
Disallow: /[_0-9a-zA-Z]+/likes
Disallow: /[_0-9a-zA-Z]+/media
Disallow: /[_0-9a-zA-Z]+/photo
# Every bot that might possibly read and respect this file
# ========================================================
User-agent: *
Disallow: /
# WHAT-4882 - Block indexing of links in notification emails. This applies to all bots.
# =====================================================================================
Disallow: /i/u
Noindex: /i/u
# Wait 1 second between successive requests. See ONBOARD-2698 for details.
Crawl-delay: 1
# Independent of user agent. Links in the sitemap are full URLs using https:// and need to match
# the protocol of the sitemap.
Sitemap: https://twitter.com/sitemap.xml
Following up: I can get a response to the https://twitter.com/sitemap.xml
page listed at the bottom of the x.com robots.txt
file using any User-Agent
other than GoogleBot
, which if used returns 404 (lol).
This behavior very much seems like a choice by x.com. I think the reason they're using an endless redirect loop on certain paths (ex: /status/01234...) is to run up cloud bills for competitors as much as possible. Well-played 😂
Try for yourself, this is the only path I've found that yields a good response:
const url = "https://twitter.com/sitemap.xml" //<-- works w "User-Agent": "ZZZ", no "From"
try {
const metadata = await urlMetadata(url, {
requestHeaders: {
'User-Agent': 'ZZZ'
}
})
console.log(metadata)
} catch (err) {
console.log(err)
expect(err).toBe(undefined)
}
There are issues with Twitter urls. Is this something we can fix somehow?