Design and develop iteration one mvp

Imported from @nicholaschiang's original Linear issue TS-12.

From the README:

How it works

High level

Tweetscape uses hive.one to determine who are the most reputable (i.e. the "smartest") people in a specific field (e.g. who are the experts in ETH, BTC, NFTs, or Tesla) on Twitter; hive.one acts as a reputation layer for the internet, determining who you can trust through a weighted graph of who follows who (e.g. a reputable user following another user raises that other user's "attention score" by more than if some random Joe follows them).

Tweetscape then uses Twitter's API and that list of "smartest" people to get links to the articles most abundantly (and most recently) shared by the "smartest" people on Twitter for a given topic (e.g. ETH, BTC, NFTs, or Tesla). It also shows you the conversation around each link; you get to see the best links and what the smartest people are saying about them.

Low level

Tweetscape is a full-stack React application built with Remix and React Router.

Every 24 hours, when a user visits tweetscape.com, we:

Fetch the top influencers from hive.one (using an ETag to de-dupe requests):

GEThttps://api.hive.one/v1/influencers/top

Fetch the top 50 links that were most abundantly (and most recently) shared by those influencers on Twitter:

TODO: Figure out the best, most performant way to do this. Perhaps I'll setup a webhook or use some type of persistent storage to only query for changes (i.e. new Tweets) shared by our top influencers.

Server-side render that list of links (and their corresponding conversations) to send to the client—you.

The aforementioned fetched data and generated HTML are both cached at the edge with Redis and SWR, respectively. We actually run the application at the edge too with Fly.io. One of our goals with Tweetscape is to save you time—primarily by rescuing you from Twitter's arbitrary wormhole of a feed—but also by optimizing our app to run even faster than Twitter, saving you milliseconds that you can then spend learning about the wisdom age 😎.

From @nicholaschiang on Monday, 2/28/22, 1:09 PM PST:

Iteration two:

Fetch all 12989 influencers from hive.one in batches of 100.
For each influencer, fetch all 3200 tweets from timeline in batches of 100.
From each tweet, construct an articles database ranking the top articles by attention score.

Trying to run all of that with Cloudflare Workers probably isn't the best idea. I tried building it--and got something that kinda works--but working within Cloudflare's serverless constraints (e.g. max 50 subrequests--including cache calls--per worker invocation, max 16 recursive calls per worker invocation, etc) greatly hinders my ability to generate a complete truth dataset that accounts for every tweet (accessible from Twitter's API) from every influencer (from Hive's API). Thus, it's impossible to say if this whole Tweetscape algorithm would even work to surface useful "insider" information and articles. Currently, it kinda works for Tesla, but is quite low quality everywhere else (and even the links in the Tesla topic are greatly skewed towards a few reporting companies that write solely about Tesla).

I think iteration three would look something like:

Construct central source of truth database (in a distributed PostgreSQL database on Fly.io or using Cloudflare Durable Objects).
Update that central source of truth after anyone visits tweetscape.co (or every 24 hours with a CRON trigger).
Fetch all that Twitter and Hive data from a CRON job running on some CI provider (or EC2) that has ample resources and time to fetch as much data as we can to construct the most wholistic, complete database of articles as possible.

From @nicholaschiang on Sunday, 2/27/22, 12:22 AM PST:

Update:

Deployed to custom domain (tweetscape.co)
Iteration one MVP algorithm:
- Fetches top 100 influencers from hive.one
- Fetches top 100 most relevant tweets:
- Use the topic (e.g. tesla or ethereum) as a search keyword
- Only get tweets with links (i.e. has:links)
- Only get tweets from the top influencers (e.g. from:elonmusk)
  - Add as many influencers as possible until 512 char limit reached
- TODO: Using Twitter's Search API has many inherent constraints (e.g. 512 char query limit); perhaps I should instead use the Timeline API or webhooks or something similar
- Filters those tweets to only get those with article links (i.e. URLs that aren't twitter.com)
- Rank links by the sum of their tweets' authors' attention scores
- Include referenced tweets (e.g. if @elonmusk retweets something @tesla posted, we add both @elonmusk's attention score and @tesla's attention score to the posted link's score sum)
  - Unless the referenced tweet's original author isn't one of the top 100 influencers we fetched in step one from hive.one (and thus we don't know the original author's attention score)
  - TODO: Perhaps fetch the referenced tweet's original author's influencer profile from hive.one if it wasn't included in the top 100 fetched earlier
  - TODO: Perhaps investigate whether it would be possible to traverse the entire reference tree (e.g. if a tweet is a retweet of a quote of a retweet of an original tweet)
- Fetch the HTML for each link (in parallel; with a 5s timeout) and use Cloudflare's HTMLRewriter to parse the article title and description meta tags
- TODO: Abort fetch as soon as title and description are found; meta tags are always in the document <head> which is at the top, so there's no need to download the entire thing
- TODO: Find a way to scrape the title and description of a site protected by Cloudflare (e.g. a number of the dogecoin articles are from change.org which uses Cloudflare DDoS)

The aforementioned algorithm works OK for tesla, but isn't as effective for other topics.

From @nicholaschiang on Friday, 2/25/22, 11:27 PM PST:

See: https://tweetscape.nicholaschiang.workers.dev/tesla for WIP

From @nicholaschiang on Friday, 2/25/22, 11:27 PM PST:

Working fine right now; I'll need access to our custom domain name to further optimize performance (which is currently quite slow as I'm fetching data from both Hive and Twitter on each page load).

To optimize performance on Remix + Cloudflare:

Setup custom domain name (possibly also pay for their Pro plan as image optimization is nice).
Store Twitter, Hive, and article website HTML in Cloudflare cache every 24 hrs.
TODO: Figure out way to revalidate that cache (i.e. refetch everything directly) after the initial render has been sent to the client (i.e. recreate incremental static generation but on Remix and Cloudflare).

To optimize performance on Next.js + Vercel:

Use incremental static generation to fetch Twitter, Hive, and article website HTML.
To do so, I'll have to replace all Cloudflare specific APIs (e.g. HTMLRewriter and cfs) with Node.js modules (e.g. metascraper and got) that can do the same things.
Use Next.js' built-in image optimization.

Now that I've developed a basic working version, I'm actually thinking that this would be a perfect application for ISR. However, I'm also intrigued by Remix's blog post which demonstrates that hosting Remix entirely on the edge (e.g. on Fly.io or Cloudflare) is actually faster than Next.js ISR. However, as the time it takes to fetch page data and generate dynamic content increases, using ISR begins to make more and more practical sense. The attraction of Remix is based on the assumption that whatever APIs you're consuming in your loader function will respond in ms and not sec.

rooteco / tweetscape