iSarabjitDhiman / TweeterPy

TweeterPy is a python library to extract data from Twitter. TweeterPy API lets you scrape data from a user's profile like username, userid, bio, followers/followings list, profile media, tweets, etc.
MIT License
121 stars 17 forks source link

Couldn't find guest token #61

Open zxj0302 opened 2 months ago

zxj0302 commented 2 months ago

Screenshot from 2024-05-20 06-41-47 I generate a new session with 'twitter.generate_session()' after each 50 calls of get_retweet() to avoid the error 'rate limit exceed'. However, the error in the picture above will occurs sometimes, like after 200/300 calls of get_retweet(). It stops me from scraping a large number of tweets at a time. I was wondering how can I fix this? Thanks a lot!

P.S.: I used residential proxy with different ip for each request.

iSarabjitDhiman commented 2 months ago

Hey,

You have to instantiate a new twitter session if the rate limit exceeds. I mean of course you can wait for the rate limits to be renewed but if you re-instantiate the session, it will bypass the limits. Take a look at this one : https://github.com/iSarabjitDhiman/TweeterPy/issues/20#issuecomment-1712035023

Let me know if it helps.

zxj0302 commented 2 months ago

Thank you! I use twitter=Tweeterpy() to re-instantiate frequently, still got the error in the picture sometimes. However, the frequency is acceptable for me.

iSarabjitDhiman commented 1 month ago

Hmm strange. I will take a look at the code shortly and see if there is any bug.

Oh btw I think datacenter proxies work too, not sure but you can try.

iSarabjitDhiman commented 1 month ago

Hey @zxj0302 Hopefully its fixed in this commit f2e2535

I forgot to remove _DEFAULT_SESSION from config.py module, it was such a silly mistake. But never mind, its fixed now. Please let me know once u have tested it.

Thanks

Edit: Make sure to update the package before testing.

zxj0302 commented 1 month ago

Hey @zxj0302 Hopefully its fixed in this commit f2e2535

I forgot to remove _DEFAULT_SESSION from config.py module, it was such a silly mistake. But never mind, its fixed now. Please let me know once u have tested it.

Thanks

Edit: Make sure to update the package before testing.

Good morning! Thank you for your work! I tested on my code, and still got a similar error: Screenshot from 2024-05-25 07-18-16

BTW, I have some questions that confuses me a lot:

  1. Is it possible to run it in parallel, e.g. using multiprocessor/futures and initiate many TweeterPy() instances
  2. When I use it to scrape tweets, some times it will return a structure with no 'result' even the tweet can be accessible and has content indeed. Sometimes I will get len = 0 when I use the following code. If I print the 'tweet', it also contains no 'result' under 'tweetResult'. However, if I re-run the code, it may get the expected result with the same 'tweet_id'. This will happens more frequently especially I try to scrape in bulk and especially after scraping tens/hundreds of tweets. I was wondering why it happens, something went wrong with my residential proxy, or I encountered anti-crawler mechanism?

    'tweet = scraper.get_tweet(tweet_id)' len(tweet['data']['tweetResult'])

  3. I am using dynamic residential proxy, and I was wondering should I set the proxy to 'change ip for each request' or use the same proxy ip for each session( = an instance of TweeterPy()? ) ?

Thank you!!!❤️❤️❤️

iSarabjitDhiman commented 1 month ago

Hey,

  1. Yes you can run parallel but you need to create a new instance each time, you can keep using the same instance unless it exceeds rate limits. You might need to keep an eye on the rate limits.
  2. Yes, again its the rate limits. I understand that you are using residential (rotational IPs) proxy, but the unique IPs don't bypass the rate limits. Its the guest token that resets the rate limits. So when you create a new instance, the rate limits get reset.
  3. Well, static IPs are better but are kind of expensive than dynamic ones. But if you don't mind getting your accounts banned, you can continue with the dynamic ones, even datacenter proxies work too. If you are using logged in sessions, static proxies are better. But if you are using guest sessions for scraping, then dynamics ones do the job.

Oh wait, I just checked the first picture you attached, it says SSL error. Btw where did u get the proxies from?

I will fix it all shortly, please keep me posted.

Thanks.

zxj0302 commented 1 month ago

Thank you for your reply! For Q2, you mean that I encountered rate limit? Actually I re-initiate a instance of TweeterPy before get the rate limits(50 requests in 15 mins). That's why I got confused that I get unexpected return(no 'result' in 'tweetResult') occasionally or frequently when scrape in bulk. For proxy, I am using dynamic residential proxy provided by smartproxy or Roxlabs. It is cheap compared with others :smile: Thank you again! SALUTE! :saluting_face:

iSarabjitDhiman commented 1 month ago

Hey, Can u test now? and please attach the screenshot of the error.

zxj0302 commented 1 month ago

@iSarabjitDhiman Yes and testing, no SSL error/couldn't get guest token error occurred till now. However, still got this error

Thank you for your reply! For Q2, you mean that I encountered rate limit? Actually I re-initiate a instance of TweeterPy before get the rate limits(50 requests in 15 mins). That's why I got confused that I get unexpected return(no 'result' in 'tweetResult') occasionally or frequently when scrape in bulk. For proxy, I am using dynamic residential proxy provided by smartproxy or Roxlabs. It is cheap compared with others 😄 Thank you again! SALUTE! 🫡

iSarabjitDhiman commented 1 month ago

Could u share the screenshot? BTW, I added a debug message in there https://github.com/iSarabjitDhiman/TweeterPy/blob/d6fd64ce509104aeb53e0d9ccd4c20b340a08022/tweeterpy/tweeterpy.py#L167

Please check your log file.

zxj0302 commented 1 month ago

You mean the screenshot of log file or what?

iSarabjitDhiman commented 1 month ago

Yes the log file screenshot should be fine, make sure to blur the sensitive data if there is any.

zxj0302 commented 1 month ago

tweeterpy.log The log file attached.

iSarabjitDhiman commented 1 month ago

Hmmm I don't see the guest token error in there. Looks clean to me. Is this the correct log file?

zxj0302 commented 1 month ago

image If the last numbers in each row is the number of bytes received, then 52 bytes contains nothing, only {'data':{'tweetResult':}} I think.

iSarabjitDhiman commented 1 month ago

So the guest token error is gone?

zxj0302 commented 1 month ago

Hmmm I don't see the guest token error in there. Looks clean to me. Is this the correct log file?

image You want the log that shows the 'guest token error'? or the log for now to see whether it is fixed? The file I sent is the log for now, and haven't have that error util now

iSarabjitDhiman commented 1 month ago

The guest token error logs. I fixed the UnboundLocalError here #63 But I am looking for the "Couldn't find guest token" error you got initially.

zxj0302 commented 1 month ago

The log was overwritten😅 I will send the lof to you if I get the same error.

zxj0302 commented 1 month ago

image This is what I scraped, I set the 'tweet_type' to 'deleted' if there is only {'data':{'tweetResult':}} and no 'result' under 'tweetResult'. However, the fact is that most of them are not 'deleted' and have content in deed. I wonder why it happened.

I checked the log and no rate limit reached. Because I re-initiate Tweeterpy() for each 40 requests, the limitation is 50.

iSarabjitDhiman commented 1 month ago

Oh I see, I just tested it on my side. The reason it return an empty dataset is that there is no tweet for the given tweet ID. Take this tweet id for an instance : 1327268169774374913 If I use get_tweet() method It returns this {'data': {'tweetResult': {}}} this.

But if I first log in with login() method and then try to fetch the same tweet ID, it would give us some extra useful information.

In my case it throws this error:

Exception: _Missing: No status found with that ID.

Seems like, twitter doesn't give guest users any meaningful error messages. But if u are logged in, you get those messages/warnings/errors.

Let me know if there is still any confusion.

zxj0302 commented 1 month ago

Yes I know that may happen. But only small number of them are really cannot be accessed or deleted. For example, row 1968, I can get the tweet without login as the picture shows. image However, I still got an empty dataset. I checked many ids manually and lots of them have content and can be accessed in deed but still got empty. However, I run get_retweet() one time to get that tweet using id, I will get the expected non-empty return. So confusing.

zxj0302 commented 1 month ago

Oh, the 'couldn't got guest token error' occurs again. Detail in the end of the log file attached. tweeterpy.log

iSarabjitDhiman commented 1 month ago

You haven't updated the package, this is the reason you are still getting that guest token error. About the no tweets data part, please log in and try to fetch that particular tweet you attached above and you will get the error message. Most probably the error is : Exception: _Missing: No status found with that ID. Let me know how it goes.

zxj0302 commented 1 month ago

Hi!

  1. I updated to version 1.1.4.

image

And still in log file found error like this:

image

The log file attached:

tweeterpy.zip

You can search '[ERROR]' at the end of the log.

  1. I think the 'Exception: _Missing: No status found with that ID.' is only for those tweets cannot be accessed. In my code, I got this: image Take id '1337575121809272836' for example, I can get this tweet on my browser: image I can also scraped it with this simple code withot log in: image However, If I scrape tweets in bulk, I always get empty data, as you can see, there are so many tweets set as 'deleted'. I think my code has no bug about this part. That's why I am so confused and don't know whether I got anti-crawler mechanism even if I used proxy. test_code.zip
iSarabjitDhiman commented 1 month ago

I believe the SSL error is due to the proxies. Do you get this error without proxies? I will look into the "empty response" part soon.

zxj0302 commented 1 month ago

Thank you for your reply! Haven't slept bro? I is 5:41 am now :timer_clock: (may be 3:41 am in you zone) Hard to say whether the SSL error will occurs without proxy, I cannot scrape in bulk without proxy. So it may not be easy to re-produce without proxy. Yes, the empty response part is more important. After scraping 1000 tweets, almost all tweets are empty. And for the first 1000, that empty response may happens occasionally.

iSarabjitDhiman commented 1 month ago

Hey man,

Please test tweeterpy-1.1.5-py3-none-any.zip , this should solve your problem. Extract the zip file and make sure the file is in your current working directory

pip install tweeterpy-1.1.5-py3-none-any.whl

# After u install, don't use the config file. Just pass the proxy directly into the constructor while creating an instance
from tweeterpy import TweeterPy
twitter = TweeterPy(proxy={"http":proxy_here, "https":proxy_here})

Oh yeah and don't worry, I sleep late night. I am a full time freelancer so I am kind of used to it.

Let me know how it goes.

zxj0302 commented 1 month ago

Hi bro, Thank you! :heart: Tested. It is amazing that the empty rate is much lower although still some non-empty tweets got empty result sometimes. It is magic!! and I am so wondering how you find the reasons and what did you do to improve this?! As you may want, the log file and tweets scraped(for you to check the correctness) attached. version1.1.5-1.zip

iSarabjitDhiman commented 1 month ago

Hey @zxj0302 Hopefully its fixed in this commit f2e2535

I forgot to remove _DEFAULT_SESSION from config.py module, it was such a silly mistake. But never mind, its fixed now. Please let me know once u have tested it.

Thanks

Edit: Make sure to update the package before testing.

Hey @zxj0302 So here is what happened , Remember that I was using a default session initially for some reason. In fact I barely touched the config.py module ever since I am working on this project. I made this tool with "one user at a time" prospective and configured some global settings in config.py module. But now people are using it for different purposes and with multi-threading obviously. So the global setting gets overwritten every time a new instance is created. In your case, all the instances were using the most recently initialized instance's settings i.e. the proxy. Well Now I have work to do and make the tool work with concurrency (multi-threading)

Thanks for the update btw.

I will release a new build soon.

zxj0302 commented 1 month ago

Thank you for your reply and great work! Hoping to see the new version with no 'empty result' for any non-empty tweets and the parallel version. SALUTE! :saluting_face: