Twitter support - Githubissues

lemon24 commented 2 years ago

Some notes:

Main use case: get updates on someone's tweets, e.g. https://twitter.com/qntm; maybe replies too.
- Account / API key not required (it kinda defeats the purpose).
From 20 minutes of research, snscrape seems to be working (other popular ones seem broken).
- The lib part is not stable, but usable.
- We can use our own Requests session (by setting a private attribute).
This won't really fit with the retriever/parser model we have now.
- 222 has the same issue, converge.
- We can use the date of the last tweet as Last-Modified.
- We need a limit the first time (scraping is paginated, if we go all the way to the beginning of an account it'll take ages).
Should model the URLs on Twitter's (https://twitter.com/$user, https://twitter.com/$user/with_replies, etc.).
Presentation matters:
- Threads should be shown in a sane way.
- Media should be inlined (e.g. a link to an image should be shown as an <img ...>).
- Titles should work with the dedupe plugin (likely, no title).

samuelclay commented 2 years ago

I'm curious why the API key isn't a requirement? On NewsBlur we have individual users integrate Twitter over oauth and use their quotas to do the work, backing off in the case of 429s.

My understanding of this project is that it would be the backend for a news reader that is multi-user, in which case, asking a user to oauth with twitter when subscribing to a twitter.com url would be the desired path.

Here's my twitter fetcher and I'm still noodling on how to display threads more than a single reply deep.

https://github.com/samuelclay/NewsBlur/blob/master/utils/twitter_fetcher.py

lemon24 commented 2 years ago

Hey, thanks for reaching out! :)

Sorry the reply got so long, feel free to ignore the second half (I'm mostly thinking out loud).

this project [...] would be the backend for a news reader that is multi-user

Currently, reader is single-user. That said, it should be possible for it to be used in a multi-user context (details below).

I'm curious why the API key isn't a requirement?

To rephrase: As a user, I'd prefer not having to set up a Twitter account (I just want to follow a few public accounts, like I would a blog).

I plan to implement this initially as an experimental plugin (something like sqlite_releases), and then work up from that. If possible, I'd prefer to figure out the auth stuff later.

we have individual users integrate Twitter over oauth and use their quotas to do the work

Yup, seems like the right thing to do. I assume scraping Twitter aggresively leads to throttling (best case), and is against their terms of service, especially if doing it as a business.

Compared to what reader can do now, handling secrets on behalf of many users seems like a big undertaking; I'd prefer not storing those in the main database in plain text. (OTOH, fully encrypted storage might take care of that.)

still noodling on how to display threads more than a single reply deep

Haven't thought much about this.

The initial plan was to just look at the accounts's tweets (ignoring anyone else), and assemble threads into a single entry/article.

For nested conversations, at first I'd go with a tree-style thing, eventually making replies collapsible/collapsed (in a <details>).

Still not sure how to map this to the reader data model. E.g. what happens if new tweets are added to a thread after the user marked the thread as read? If thread ~= article, they should be ignored, but that doesn't seem like the right thing to do.

Some background on reader development.

My main use case is a single-user web app.

I don't have a lot of time to work on reader, and at the moment it's mainly a "scratch my own itch" kind of thing. I'm keeping it small deliberately, so I don't lose motivation to work on it.

I haven't ventured into multi-user because I'd either need to use it in that way day-to-day or do a lot of research to get it right (I expect complexity / the number of use cases would be 3-10x of what they are now).

Some thoughts on multi-user in reader (mostly so I don't forget them).

The way I think about it, multi-user would be mostly transparent to Reader, and would be handled by the underlying Storage (DAO-like thing) – you'd have a storage that adds "where user_id == 123" to any query, and the web app using a Reader instance wouldn't have to care; plugins would work unchanged.

The cheapest (code-wise) way to get this is to simply have one SQLite database per user.

With "real" multi-user, you'd likely want to separate feeds data from user data, e.g. so a feed is only fetched/stored once, even if multiple users have it. (Depending on scale, a lot of storage/search changes would be needed to keep the thing efficient.)

(Regardless of what you do, you'd still have to have additional APIs for account-related stuff.)

samuelclay commented 2 years ago

E.g. what happens if new tweets are added to a thread after the user marked the thread as read

Unless they're explicitly muting a thread, I would expect new tweets to continue to come in as new stories.

lemon24 commented 2 years ago

That seems acceptable, but there are two usability issues that I think should be addressed:

Tweet N in a thread should be shown in context (with the tweets leading up to it as part of the story, maybe?).
For N tweets composing a thread, I should only see one story (the most up-to-date version of the thread). a. Also, there should be just 1 story/thread in the end, because e.g. I don't want to search for something and see N entries (corresponding to each tweet in the thread).
It would be nice if I were told which was the last tweet I read in a thread.

One way of doing this is to have the dedupe plugin handle it: instead of deduping on title+similarity, for Twitter feeds it would do it on a "dedupe string" that's set to the same unique id for all tweets in a thread (likely the id of the first tweet). E.g.:

id: 1, dedupe id: 1 text of first tweet

id: 2, dedupe id: 1 first tweet text of second tweet

id: 3, dedupe id: 1 text of first tweet text of second tweet text of third tweet

Once id:3 is posted, it would be the only entry for that thread (id:1 and id:2 having been deleted by the dedupe plugin).

... but, to achieve "new tweets come in as new stories", the dedupe plugin should not mark id:3 as read if id:2 was. (For normal articles, this is desirable.)

To have this, the feed would need to tell 2 things to the dedupe plugin:

"dedupe by id <...>, not by title+similarity"
"don't mark as read after dedupe"
- update: although, maybe the user wants the entire tread to go away ("I don't care about this thread / any new tweets in it"); we could use the "I don't care" (read+unimportant) combo for this

Moving the problem around a bit, we could just update id:1 in-place, and mark it as unread when new tweets come in.

Currently reader doesn't/can't do that – a read entry whose <updated> changes remains read (because for actual feeds, the update is a relatively minor change most of the time).

To have this, the retriever/parser would need to tell Reader:

"mark this entry as unread if it isn't"

Neither of the above solves (3).

I guess the dedupe plugin / parser could set some entry metadata saying "the user read this far". (The parser setting metadata is another thing not possible at the moment.)

lemon24 commented 2 years ago

Unless they're explicitly muting a thread, [...] new stories.

Reading this again, it seems to imply two things:

it's possible to express the concept of "group of stories within a feed"
you can set flags on a group (mute, in this case)

Currently (2.9), reader can emulate mute:

"don't care" == read+unimportant for entries
- the mark_as_read plugin can do this for "future" entries
a muted feed tag with default feed_tags=['-muted'] for feeds (i.e. all entries in the feed)
- once entries get tags, we could use the same method for them
- alternatively, the mark_as_read plugin can also mark as read all incoming entries (with a .* pattern)

... but there's no way of expressing "group of entries"; at a minimum, the "dedupe id" / "group id" thing described in my previous comment would be required.

What I'm trying to achieve here is find a balance:

I'd like this to be implementable on top of reader, so the core data model doesn't get too complicated.
- But it shouldn't be too hard to do.
Implementing it in a plugin makes the plugin API better for other users.
Implementing it in a plugin first can give me an idea of what the core data model should look like, without having to commit to a potentially half-baked data model in the stable API.

(Again, this last part is mostly me talking to myself :)

lemon24 commented 2 years ago

I did some more thinking / investigation.

First, regarding API keys: Asking the user to provide one is the way to go, @samuelclay was right :)

Turns out snscrape is using a baked-in bearer token underneath, and that obviously leads to throttling (and my "no API key" idea hinged on snscraper working without one). Most (other) Twitter scraping libraries I looked at seem to be unmaintained, which probably indicates that scraping is too hard.

Also, getting a bearer token for essential access is pretty straightforward, so I don't consider that an issue anymore.

Second, I experimented a bit with Tweepy and the v2 API, and I have a clearer idea of what I want threads to look like.

As mentioned before, a thread should be (or at least look like) a single article. Tweets would be shown in a list, with replies collapsed by default:

first tweet (2 replies, click to expand)

second tweet (2 replies, click to collapse)

first reply

reply to reply

second reply

third tweet

When a new tweet appears in a thread, it gets added to the existing article, and the article becomes unread.

It's also possible for a subscription to leave replies out entirely (it's easier to retrieve tweets for this, so we should likely do this one first); the example above would look like:

first tweet

second tweet

third tweet

Some implementation notes (v2 API).

Tweets in the same thread can be grouped by tweet.conversation_id, which is the id of the first tweet in the thread; replies (regardless of author) share that conversation_id too.

Tweets in the same thread can be arranged in a tree by tweet.referenced_tweets[type=replied_to].id (from child to parent); the tree for the example above (note "second tweet" is a reply to "first tweet"):

first tweet

second tweet

first reply

reply to reply

second reply

third tweet

Getting only the user's feed should be doable with (pseudocode):

tweets = tweepy.get_users_tweets(user, since_id=last_tweet, exclude='replies')
for conversation_id, group in group_by_conversation(tweets):
    append_tweets(conversation_id, group)
set_last_tweet(conversation_id, max(tweets))

For replies, the only sane way of doing it I could find is to do one additional search per conversation (you can only use one conversation_id:... per query):

tweets = tweepy.get_users_tweets(user, since_id=last_tweet, exclude='replies')
for conversation_id, group in group_by_conversation(tweets):
    append_tweets(conversation_id, group)

for conversation_id in get_conversations(newer_than='30 days'):
    # will only return tweets from the last week (essential access)
    conversation_tweets = tweepy.search_recent_tweets(
        f"conversation_id:{conversation_id} is:reply",
        since_id=last_tweet,
    )
    append_tweets(conversation_id, conversation_tweets)
    tweets.extend(conversation_tweets)

set_last_tweet(conversation_id, max(tweets))

How this maps to the reader data model, I'm not entirely sure.

lemon24 commented 2 years ago

Prototype showing how a Twitter reader plugin could work: https://gist.github.com/lemon24/b7c4039ee6657ebb2347b5e338a0dca7

lemon24 commented 2 years ago

So, after thinking about this a bit, we don't really need to change the updater like I did in the prototype (that was an artifact of monkeypatching my way in), we can move the merging of old/new entry data earlier on in the update pipeline:

self._storage.get_feeds_for_update \ 
| retriever.process_feeds_for_update \  # add extra data
| self._updater.process_old_feed \ 
| xargs -n1 -P $workers self._parser.retrieve \ 
| self._parser.parse \ 
| self._get_entries_for_update \ 
| parser.process_entry_pairs \          # merge entries
| self._updater.make_update_intents \ 
| self._update_feed

lemon24 commented 2 years ago

reader 2.13, now available on PyPI, includes experimental Twitter support; docs.

This was waaay more complicated than I expected initially; it's taken at least 50 hours of work so far.

There's still a bunch of stuff to be done, but most things are there. What remains, will be done ... later. :)

@samuelclay, if you're still interested in this topic, consider giving the plugin a try. Most of the tweet retrieving logic and Twitter JSON -> HTML stuff should be reusable outside reader, and it might even be possible to pull them into a separate package.

https://github.com/lemon24/reader/blob/cba3603b16604fd24f834757cbf14ed65f2a6b4e/src/reader/_plugins/twitter.py#L56-L71

lemon24 / reader

Twitter support #271

222 has the same issue, converge.