rooteco / tweetscape

the supercharged twitter feed
https://prototype.tweetscape.co
GNU Affero General Public License v3.0
18 stars 2 forks source link

Investigate why tweets change #367

Closed nicholaschiang closed 2 years ago

nicholaschiang commented 2 years ago

Imported from @nicholaschiang's original Linear issue TS-38.

The local development server seems to change tweets data on page reload… why is that? Is this some limitation to PostgreSQL queries that I'm unaware of? Is my articles view distinct clause not ordered well?

nicholaschiang commented 2 years ago

Perhaps this has something to do with an invalid assumption I think I made when designing the articles query: Each tweet may have more than one link and thus perhaps the distinct() filter is messing up.

nicholaschiang commented 2 years ago

I have to debug my current articles view PostgreSQL query to:

  1. Include all of each tweets' URLs in the view, and;
  2. Debug this issue (that may have arisen from that bad assumption that each tweet only has one URL);

Before I'll be able to fix TS-17 fully (i.e. and format all of the included links).

nicholaschiang commented 2 years ago
           /* 
           This query omits (Tesla, 2758) but when you filter for author_name = 'Tesla' or a specific link_id = 2758 it is included. 
           Perhaps this has something to do with Beekeper Studio limitations? Or PostgreSQL inaccuracies with select distinct()?
           */
          select distinct on (tweets.author_id, urls.link_id)
            urls.link_id as link_id,
            tweets.*
          from urls
            inner join (
              select 
                influencers.name as author_name,
                influencers.username as author_username,
                tweets.*,
                scores.cluster_id as cluster_id,
                scores.insider_score as insider_score,
                scores.attention_score as attention_score,
                to_json(influencers.*) as author,
                to_json(scores.*) as score,
                json_agg(mentions.*) as mentions,
                json_agg(tags.*) as tags
              from tweets
                inner join influencers on influencers.id = tweets.author_id
                inner join scores on scores.influencer_id = influencers.id
                left outer join mentions on mentions.tweet_id = tweets.id
                left outer join tags on tags.tweet_id = tweets.id
              group by tweets.id, scores.id, influencers.id
            ) as tweets on tweets.id = urls.tweet_id
nicholaschiang commented 2 years ago

Aha! I found one bug in my PostgreSQL logic: I'm assuming that there's only a single score for each influencer, but that's not the case (e.g. both Elon Musk and Tesla have two scores each).

nicholaschiang commented 2 years ago

Here's the problem: because certain influencers (e.g. Elon Musk and Tesla) have scores for multiple clusters.id, whenever the select distinct happens to use the row that has a different clusters.slug than the one that we're querying for (e.g. if the @elonmusk row uses his attention_score for DogeCoin but we're querying for articles about Tesla, all of Elon's tweets will be omitted from the results).

The fix: select distinct on (urls.link_id, tweets.author_id, tweets.cluster_id so that I get all of the rows that have unique links, authors, and clusters. Then, the remaining of the query filters for a specific cluster so no duplicate tweets (from the same author for the same link but with a different cluster's score) should show up.