Open robertoszek opened 1 year ago
Hello, I'm using your script for some days and I do agree your statement. I was wonderring about the support of the Twint python library (https://github.com/twintproject/twint), wich is capable to scrap Twitter content. Could be a good start to add this support.
Saw that earlier, it sucks...
Just to confirm, if you get a paid access to the Twitter API, you theoretically still can use it as is right @robertoszek ? Providing you use a valid Twitter token of course.
Just to confirm, if you get a paid access to the Twitter API, you theoretically still can use it as is right @robertoszek ? Providing you use a valid Twitter token of course.
Potentially, yes. Assuming they don't change the baseline API endpoints behavior or add additional steps to authenticate with a paid token, the bot would theoretically continue to work.
The thing is nobody really knows how's it going to change or be implemented. We'll have to wait until the 9th and see once the dust settles what are our options going forward.
As a potential replacement, this scrapper seems good, to, and quite light : https://github.com/JustAnotherArchivist/snscrape It's working great, today.
As a potential replacement, this scrapper seems good, to, and quite light : https://github.com/JustAnotherArchivist/snscrape It's working great, today.
It seems to use the unofficial GraphQL endpoint for scraping data: https://github.com/JustAnotherArchivist/snscrape/blob/23ebdd2a3ce6c3e93012e2b5bc7c2b02c749aaf2/snscrape/modules/twitter.py#L1704
In addition to https://api.twitter.com/2/search/adaptive.json
:
https://github.com/JustAnotherArchivist/snscrape/blob/23ebdd2a3ce6c3e93012e2b5bc7c2b02c749aaf2/snscrape/modules/twitter.py#L1549
We already use https://api.twitter.com/2/search/adaptive.json
with guest tokens on the bot currently:
https://github.com/robertoszek/pleroma-bot/blob/9a64891385d8321a84c37f3fba1fba6bd7b785ee/pleroma_bot/_twitter.py#L565
However the adaptive.json
endpoint was severely limited recently (to only top results for non logged in users, removing any option to scrape by latest).
I'll look into how feasible would be to use the GraphQL endpoint for our own scraping too.
Using snscrape, I just did a request to get last 100 tweets for a specific Twitter user (@transportsidf), it worked well. So I don't know what are the limits, but if we can get at least 100 tweets at time, it seems enough for a bot, I think. But, using Plroma in guest mode, gives me this error (same Twitter account) :
Gathering tweets... 0 ✖ 2023-02-04 21:17:59,995 - pleroma_bot - ERROR - Unable to retrieve tweets. Is the account protected? If so, you need to provide the following OAuth 1.0a fields in the user config:
Should I use my API token and it's working fine. I don't know if I do something wrong or if it is a limitation/change in how guest mode works.
I guess adding scraping capabilities to the bot has become a priority.
As a bridge solution, maybe pleroma-bot could scrape a Nitter instance? I'd be happy to set up a Nitter instance for my own pleroma-bot to scrape.
Then there's https://github.com/zedeus/nitter/issues/389
Looks like it's finally here https://tapbots.social/@paul/110109551743991074
We just saw our access revoked overnight :/
Same here, it finally stopped working yesterday. I’ll need to start experimenting with using RSS via Nitter.
Edit: https://github.com/mahrtayyab/tweety looks like a great alternative to use instead of polling RSS.
My API key switched back to free plan so I can't extract tweets anymore, too. As I previoulsy mentionned, snscrape is also still working to retrieve tweets.
I switched to using rsshub, tried nitter but that was very buggy. I think adopting the full graph endpoints would be the best path forward.
This one, very simple, is working, too : https://gitlab.com/jeancf/twoot It is using random nitter instances to extract tweets.
@robertoszek any chance of future developpments to handle the end of the free API using one of the above solutions ?
rsshub isn't perfect either, html ends up being embedded:
I'm also having problems with twitter api. My bots are no longer working and I can't make it work with RSS source. I've found a python scrapper for nitter called pnyter and I'm starting to explore it to see what I can do. I've created a matrix channel in case anyone wants to join and exchange ideas #pletomabot:matrix.org https://matrix.to/#/!DmKYBjBcZXoeKlRmMU:matrix.org?via=matrix.org
Hello @AltGrCarlos the main problem here is that the creator of this bot is not active in the current time to make the necessary fixes. I would say 75% of the code is still working, and this bot is doing more than a simple scraper : it also updates the user profile, wich is great, and post tweets to mastodon. So the part needing a fix is the scrape from Twitter part. Everything else can be kept as-is. If you could create a fork with an upadated and fonctionnal scrapper, that would be great.
PS : I don't know about Matrix, in term of live chatting Discord must be more used.
Hello @AltGrCarlos the main problem here is that the creator of this bot is not active in the current time to make the necessary fixes. I would say 75% of the code is still working, and this bot is doing more than a simple scraper : it also updates the user profile, wich is great, and post tweets to mastodon. So the part needing a fix is the scrape from Twitter part. Everything else can be kept as-is. If you could create a fork with an upadated and fonctionnal scrapper, that would be great.
PS : I don't know about Matrix, in term of live chatting Discord must be more used.
Hi. I'm not a really good programmer, but I'm trying to understand the code before to make any modification. I'm also trying to develop my own nitter scrapper in order to get the specific information i need from twitter.
Waiting for a fix to make Pleroma work again, I have set Twoot (previously mentionned) as replacement. It's working fine without API key.
After last week's API changes breaking nitter, I'm now using https://github.com/12joan/twitter-client to generate RSS for stork.
Just so you know stork is still working and still useful. :-)
It'd be nice if I could find some way to get profile updates happening while still getting the tweets from RSS; I'll post here again if I figure out a way to do that.
Hey, sorry for being a lot less active.
I've been moving across countries during the last 6 months and between all the logistics and bureaucracy involved (getting a visa, a work permit, finding an apartment, packing, etc.) in addition to keeping a day job, it basically left little to no time to do anything else.
I'm glad this project was still somewhat useful for some of you during that time with the scraping functionality implementation still pending. My intention is to get back to it and try to make it work in the current state of affairs. Thank you all for sharing the different projects you've found success with, I'll take a look at their approach and see what works and doesn't at the moment.
Got profile info and pinned tweet gathering working. https://github.com/robertoszek/pleroma-bot/commit/c96943e6e6cdcd4725d118589686d5282197265e The user timeline scraping seems a lot more involved, requiring "guest accounts".
These guest accounts seem to be restricted by IP, so only a limited amount can be created from the same host/IP.
I'm thinking about adding a flag so they can be created easily on demand:
$ pleroma-bot --create-guest-account
being dumped to guest_accounts.json
, for example.
And if you have access to a list of proxies that could be used to generate more accounts at the same time, perhaps passing them as a text file (by a flag or on the config file):
$ pleroma-bot --create-guest-account --proxies-file my_proxies.txt
And of course the bot would also need to try generating additional guest accounts in the middle of a run if it gets rate limited. I need to think about it a bit more but there's definitely some progress being made.
I have ~50 accounts in my config and run an pleroma-bot every 15 minutes against a nitter rss feed right now as a workaround. The guest accounts last for 30 days and I've ended up needing ~6k guest accounts to keep it running the whole time without erroring out. I use geonode for proxies FYI.
As long as you use a working Nitter instance as source, you don't have to deal with guest accounts : they are used to scrape Twitter. Your bot is reading an already scrapped content, the one provided by Nitter. Well, using Nitter RSS feed, at last.
With these changes a lot of nitter instances have either turned off rss or asked people not to scrape them. I run my own so I'm not eating up guest tokens from someone else. Just keep that in mind. Generating guest tokens is extremely cheap on geonode too.
I guess adding scraping capabilities to the bot has become a priority.
Using RSS feeds as a source will hopefully continue to work after February 9th (if you can find a working Nitter instance, RSSHub or some other third-party site that's still able to generate an RSS feed).