Closed lemon24 closed 2 years ago
I'm curious why the API key isn't a requirement? On NewsBlur we have individual users integrate Twitter over oauth and use their quotas to do the work, backing off in the case of 429s.
My understanding of this project is that it would be the backend for a news reader that is multi-user, in which case, asking a user to oauth with twitter when subscribing to a twitter.com url would be the desired path.
Here's my twitter fetcher and I'm still noodling on how to display threads more than a single reply deep.
https://github.com/samuelclay/NewsBlur/blob/master/utils/twitter_fetcher.py
Hey, thanks for reaching out! :)
Sorry the reply got so long, feel free to ignore the second half (I'm mostly thinking out loud).
this project [...] would be the backend for a news reader that is multi-user
Currently, reader is single-user. That said, it should be possible for it to be used in a multi-user context (details below).
I'm curious why the API key isn't a requirement?
To rephrase: As a user, I'd prefer not having to set up a Twitter account (I just want to follow a few public accounts, like I would a blog).
I plan to implement this initially as an experimental plugin (something like sqlite_releases), and then work up from that. If possible, I'd prefer to figure out the auth stuff later.
we have individual users integrate Twitter over oauth and use their quotas to do the work
Yup, seems like the right thing to do. I assume scraping Twitter aggresively leads to throttling (best case), and is against their terms of service, especially if doing it as a business.
Compared to what reader can do now, handling secrets on behalf of many users seems like a big undertaking; I'd prefer not storing those in the main database in plain text. (OTOH, fully encrypted storage might take care of that.)
still noodling on how to display threads more than a single reply deep
Haven't thought much about this.
The initial plan was to just look at the accounts's tweets (ignoring anyone else), and assemble threads into a single entry/article.
For nested conversations, at first I'd go with a tree-style thing, eventually making replies collapsible/collapsed (in a <details>
).
Still not sure how to map this to the reader data model. E.g. what happens if new tweets are added to a thread after the user marked the thread as read? If thread ~= article, they should be ignored, but that doesn't seem like the right thing to do.
Some background on reader development.
My main use case is a single-user web app.
I don't have a lot of time to work on reader, and at the moment it's mainly a "scratch my own itch" kind of thing. I'm keeping it small deliberately, so I don't lose motivation to work on it.
I haven't ventured into multi-user because I'd either need to use it in that way day-to-day or do a lot of research to get it right (I expect complexity / the number of use cases would be 3-10x of what they are now).
Some thoughts on multi-user in reader (mostly so I don't forget them).
The way I think about it, multi-user would be mostly transparent to Reader, and would be handled by the underlying Storage (DAO-like thing) – you'd have a storage that adds "where user_id == 123" to any query, and the web app using a Reader instance wouldn't have to care; plugins would work unchanged.
The cheapest (code-wise) way to get this is to simply have one SQLite database per user.
With "real" multi-user, you'd likely want to separate feeds data from user data, e.g. so a feed is only fetched/stored once, even if multiple users have it. (Depending on scale, a lot of storage/search changes would be needed to keep the thing efficient.)
(Regardless of what you do, you'd still have to have additional APIs for account-related stuff.)
E.g. what happens if new tweets are added to a thread after the user marked the thread as read
Unless they're explicitly muting a thread, I would expect new tweets to continue to come in as new stories.
That seems acceptable, but there are two usability issues that I think should be addressed:
One way of doing this is to have the dedupe plugin handle it: instead of deduping on title+similarity, for Twitter feeds it would do it on a "dedupe string" that's set to the same unique id for all tweets in a thread (likely the id of the first tweet). E.g.:
id: 1, dedupe id: 1
text of first tweet
id: 2, dedupe id: 1
first tweet text of second tweet
id: 3, dedupe id: 1
text of first tweet text of second tweet text of third tweet
Once id:3 is posted, it would be the only entry for that thread (id:1 and id:2 having been deleted by the dedupe plugin).
... but, to achieve "new tweets come in as new stories", the dedupe plugin should not mark id:3 as read if id:2 was. (For normal articles, this is desirable.)
To have this, the feed would need to tell 2 things to the dedupe plugin:
Moving the problem around a bit, we could just update id:1 in-place, and mark it as unread when new tweets come in.
Currently reader doesn't/can't do that – a read entry whose <updated>
changes remains read (because for actual feeds, the update is a relatively minor change most of the time).
To have this, the retriever/parser would need to tell Reader:
Neither of the above solves (3).
I guess the dedupe plugin / parser could set some entry metadata saying "the user read this far". (The parser setting metadata is another thing not possible at the moment.)
Unless they're explicitly muting a thread, [...] new stories.
Reading this again, it seems to imply two things:
Currently (2.9), reader can emulate mute:
mark_as_read
plugin can do this for "future" entriesmuted
feed tag with default feed_tags=['-muted']
for feeds (i.e. all entries in the feed)
mark_as_read
plugin can also mark as read all incoming entries (with a .*
pattern)... but there's no way of expressing "group of entries"; at a minimum, the "dedupe id" / "group id" thing described in my previous comment would be required.
What I'm trying to achieve here is find a balance:
(Again, this last part is mostly me talking to myself :)
I did some more thinking / investigation.
First, regarding API keys: Asking the user to provide one is the way to go, @samuelclay was right :)
Turns out snscrape is using a baked-in bearer token underneath, and that obviously leads to throttling (and my "no API key" idea hinged on snscraper working without one). Most (other) Twitter scraping libraries I looked at seem to be unmaintained, which probably indicates that scraping is too hard.
Also, getting a bearer token for essential access is pretty straightforward, so I don't consider that an issue anymore.
Second, I experimented a bit with Tweepy and the v2 API, and I have a clearer idea of what I want threads to look like.
As mentioned before, a thread should be (or at least look like) a single article. Tweets would be shown in a list, with replies collapsed by default:
- first tweet (2 replies, click to expand)
- second tweet (2 replies, click to collapse)
- first reply
- reply to reply
- second reply
- third tweet
When a new tweet appears in a thread, it gets added to the existing article, and the article becomes unread.
It's also possible for a subscription to leave replies out entirely (it's easier to retrieve tweets for this, so we should likely do this one first); the example above would look like:
- first tweet
- second tweet
- third tweet
Some implementation notes (v2 API).
Tweets in the same thread can be grouped by tweet.conversation_id
, which is the id of the first tweet in the thread; replies (regardless of author) share that conversation_id too.
Tweets in the same thread can be arranged in a tree by tweet.referenced_tweets[type=replied_to].id
(from child to parent); the tree for the example above (note "second tweet" is a reply to "first tweet"):
- first tweet
- second tweet
- first reply
- reply to reply
- second reply
- third tweet
Getting only the user's feed should be doable with (pseudocode):
tweets = tweepy.get_users_tweets(user, since_id=last_tweet, exclude='replies')
for conversation_id, group in group_by_conversation(tweets):
append_tweets(conversation_id, group)
set_last_tweet(conversation_id, max(tweets))
For replies, the only sane way of doing it I could find is to do one additional search per conversation (you can only use one conversation_id:...
per query):
tweets = tweepy.get_users_tweets(user, since_id=last_tweet, exclude='replies')
for conversation_id, group in group_by_conversation(tweets):
append_tweets(conversation_id, group)
for conversation_id in get_conversations(newer_than='30 days'):
# will only return tweets from the last week (essential access)
conversation_tweets = tweepy.search_recent_tweets(
f"conversation_id:{conversation_id} is:reply",
since_id=last_tweet,
)
append_tweets(conversation_id, conversation_tweets)
tweets.extend(conversation_tweets)
set_last_tweet(conversation_id, max(tweets))
How this maps to the reader data model, I'm not entirely sure.
Prototype showing how a Twitter reader plugin could work: https://gist.github.com/lemon24/b7c4039ee6657ebb2347b5e338a0dca7
So, after thinking about this a bit, we don't really need to change the updater like I did in the prototype (that was an artifact of monkeypatching my way in), we can move the merging of old/new entry data earlier on in the update pipeline:
self._storage.get_feeds_for_update \
| retriever.process_feeds_for_update \ # add extra data
| self._updater.process_old_feed \
| xargs -n1 -P $workers self._parser.retrieve \
| self._parser.parse \
| self._get_entries_for_update \
| parser.process_entry_pairs \ # merge entries
| self._updater.make_update_intents \
| self._update_feed
reader 2.13, now available on PyPI, includes experimental Twitter support; docs.
This was waaay more complicated than I expected initially; it's taken at least 50 hours of work so far.
There's still a bunch of stuff to be done, but most things are there. What remains, will be done ... later. :)
@samuelclay, if you're still interested in this topic, consider giving the plugin a try. Most of the tweet retrieving logic and Twitter JSON -> HTML stuff should be reusable outside reader, and it might even be possible to pull them into a separate package.
Some notes:
222 has the same issue, converge.