In light of changes to Twitter's API coming Feb 9

robertoszek commented 1 year ago

I guess adding scraping capabilities to the bot has become a priority.

Using RSS feeds as a source will hopefully continue to work after February 9th (if you can find a working Nitter instance, RSSHub or some other third-party site that's still able to generate an RSS feed).

edel79 commented 1 year ago

Hello, I'm using your script for some days and I do agree your statement. I was wonderring about the support of the Twint python library (https://github.com/twintproject/twint), wich is capable to scrap Twitter content. Could be a good start to add this support.

tomakun commented 1 year ago

Saw that earlier, it sucks...

Just to confirm, if you get a paid access to the Twitter API, you theoretically still can use it as is right @robertoszek ? Providing you use a valid Twitter token of course.

robertoszek commented 1 year ago

Just to confirm, if you get a paid access to the Twitter API, you theoretically still can use it as is right @robertoszek ? Providing you use a valid Twitter token of course.

Potentially, yes. Assuming they don't change the baseline API endpoints behavior or add additional steps to authenticate with a paid token, the bot would theoretically continue to work.

The thing is nobody really knows how's it going to change or be implemented. We'll have to wait until the 9th and see once the dust settles what are our options going forward.

edel79 commented 1 year ago

As a potential replacement, this scrapper seems good, to, and quite light : https://github.com/JustAnotherArchivist/snscrape It's working great, today.

robertoszek commented 1 year ago

As a potential replacement, this scrapper seems good, to, and quite light : https://github.com/JustAnotherArchivist/snscrape It's working great, today.

It seems to use the unofficial GraphQL endpoint for scraping data: https://github.com/JustAnotherArchivist/snscrape/blob/23ebdd2a3ce6c3e93012e2b5bc7c2b02c749aaf2/snscrape/modules/twitter.py#L1704

In addition to https://api.twitter.com/2/search/adaptive.json: https://github.com/JustAnotherArchivist/snscrape/blob/23ebdd2a3ce6c3e93012e2b5bc7c2b02c749aaf2/snscrape/modules/twitter.py#L1549

We already use https://api.twitter.com/2/search/adaptive.json with guest tokens on the bot currently: https://github.com/robertoszek/pleroma-bot/blob/9a64891385d8321a84c37f3fba1fba6bd7b785ee/pleroma_bot/_twitter.py#L565

However the adaptive.json endpoint was severely limited recently (to only top results for non logged in users, removing any option to scrape by latest).

I'll look into how feasible would be to use the GraphQL endpoint for our own scraping too.

edel79 commented 1 year ago

Using snscrape, I just did a request to get last 100 tweets for a specific Twitter user (@transportsidf), it worked well. So I don't know what are the limits, but if we can get at least 100 tweets at time, it seems enough for a bot, I think. But, using Plroma in guest mode, gives me this error (same Twitter account) :

Gathering tweets... 0 ✖ 2023-02-04 21:17:59,995 - pleroma_bot - ERROR - Unable to retrieve tweets. Is the account protected? If so, you need to provide the following OAuth 1.0a fields in the user config:

consumer_key
consumer_secret
access_token_key
access_token_secret (cli.py:645)

Should I use my API token and it's working fine. I don't know if I do something wrong or if it is a limitation/change in how guest mode works.

nemobis commented 1 year ago

I guess adding scraping capabilities to the bot has become a priority.

As a bridge solution, maybe pleroma-bot could scrape a Nitter instance? I'd be happy to set up a Nitter instance for my own pleroma-bot to scrape.

Then there's https://github.com/zedeus/nitter/issues/389

dawnerd commented 1 year ago

Looks like it's finally here https://tapbots.social/@paul/110109551743991074

dawnerd commented 1 year ago

We just saw our access revoked overnight :/

gigantuar commented 1 year ago

Same here, it finally stopped working yesterday. I’ll need to start experimenting with using RSS via Nitter.

Edit: https://github.com/mahrtayyab/tweety looks like a great alternative to use instead of polling RSS.

edel79 commented 1 year ago

My API key switched back to free plan so I can't extract tweets anymore, too. As I previoulsy mentionned, snscrape is also still working to retrieve tweets.

dawnerd commented 1 year ago

I switched to using rsshub, tried nitter but that was very buggy. I think adopting the full graph endpoints would be the best path forward.

edel79 commented 1 year ago

This one, very simple, is working, too : https://gitlab.com/jeancf/twoot It is using random nitter instances to extract tweets.

edel79 commented 1 year ago

@robertoszek any chance of future developpments to handle the end of the free API using one of the above solutions ?

dawnerd commented 1 year ago

rsshub isn't perfect either, html ends up being embedded:

Vardor commented 1 year ago

I'm also having problems with twitter api. My bots are no longer working and I can't make it work with RSS source. I've found a python scrapper for nitter called pnyter and I'm starting to explore it to see what I can do. I've created a matrix channel in case anyone wants to join and exchange ideas #pletomabot:matrix.org https://matrix.to/#/!DmKYBjBcZXoeKlRmMU:matrix.org?via=matrix.org

edel79 commented 1 year ago

Hello @AltGrCarlos the main problem here is that the creator of this bot is not active in the current time to make the necessary fixes. I would say 75% of the code is still working, and this bot is doing more than a simple scraper : it also updates the user profile, wich is great, and post tweets to mastodon. So the part needing a fix is the scrape from Twitter part. Everything else can be kept as-is. If you could create a fork with an upadated and fonctionnal scrapper, that would be great.

PS : I don't know about Matrix, in term of live chatting Discord must be more used.

Vardor commented 1 year ago

Hello @AltGrCarlos the main problem here is that the creator of this bot is not active in the current time to make the necessary fixes. I would say 75% of the code is still working, and this bot is doing more than a simple scraper : it also updates the user profile, wich is great, and post tweets to mastodon. So the part needing a fix is the scrape from Twitter part. Everything else can be kept as-is. If you could create a fork with an upadated and fonctionnal scrapper, that would be great.

PS : I don't know about Matrix, in term of live chatting Discord must be more used.

Hi. I'm not a really good programmer, but I'm trying to understand the code before to make any modification. I'm also trying to develop my own nitter scrapper in order to get the specific information i need from twitter.

edel79 commented 1 year ago

Waiting for a fix to make Pleroma work again, I have set Twoot (previously mentionned) as replacement. It's working fine without API key.

us3r1d commented 1 year ago

After last week's API changes breaking nitter, I'm now using https://github.com/12joan/twitter-client to generate RSS for stork.

Just so you know stork is still working and still useful. :-)

It'd be nice if I could find some way to get profile updates happening while still getting the tweets from RSS; I'll post here again if I figure out a way to do that.

robertoszek commented 1 year ago

Hey, sorry for being a lot less active.

I've been moving across countries during the last 6 months and between all the logistics and bureaucracy involved (getting a visa, a work permit, finding an apartment, packing, etc.) in addition to keeping a day job, it basically left little to no time to do anything else.

I'm glad this project was still somewhat useful for some of you during that time with the scraping functionality implementation still pending. My intention is to get back to it and try to make it work in the current state of affairs. Thank you all for sharing the different projects you've found success with, I'll take a look at their approach and see what works and doesn't at the moment.

robertoszek commented 1 year ago

Got profile info and pinned tweet gathering working. https://github.com/robertoszek/pleroma-bot/commit/c96943e6e6cdcd4725d118589686d5282197265e The user timeline scraping seems a lot more involved, requiring "guest accounts".

These guest accounts seem to be restricted by IP, so only a limited amount can be created from the same host/IP.

I'm thinking about adding a flag so they can be created easily on demand:

$ pleroma-bot --create-guest-account

being dumped to guest_accounts.json, for example.

And if you have access to a list of proxies that could be used to generate more accounts at the same time, perhaps passing them as a text file (by a flag or on the config file):

$ pleroma-bot --create-guest-account --proxies-file my_proxies.txt

And of course the bot would also need to try generating additional guest accounts in the middle of a run if it gets rate limited. I need to think about it a bit more but there's definitely some progress being made.

dawnerd commented 1 year ago

I have ~50 accounts in my config and run an pleroma-bot every 15 minutes against a nitter rss feed right now as a workaround. The guest accounts last for 30 days and I've ended up needing ~6k guest accounts to keep it running the whole time without erroring out. I use geonode for proxies FYI.

edel79 commented 1 year ago

As long as you use a working Nitter instance as source, you don't have to deal with guest accounts : they are used to scrape Twitter. Your bot is reading an already scrapped content, the one provided by Nitter. Well, using Nitter RSS feed, at last.

dawnerd commented 1 year ago

With these changes a lot of nitter instances have either turned off rss or asked people not to scrape them. I run my own so I'm not eating up guest tokens from someone else. Just keep that in mind. Generating guest tokens is extremely cheap on geonode too.

robertoszek / pleroma-bot

In light of changes to Twitter's API coming Feb 9 #120