iSarabjitDhiman / TweeterPy

TweeterPy is a python library to extract data from Twitter. TweeterPy API lets you scrape data from a user's profile like username, userid, bio, followers/followings list, profile media, tweets, etc.
MIT License
145 stars 20 forks source link

[Enhance] Data class for each type #32

Closed fpmirabile closed 1 year ago

fpmirabile commented 1 year ago

I'd be a good idea to have a library ~native~ data class, for example User or Twitt

That will help in not be drawn by the heavy payload the graphql endpoints has

For example I've been working on these two:

from dataclasses import dataclass

@dataclass
class TwitterTweet:
    rest_id: str
    user_id: str
    full_text: str
    created_at: str
    retweet_count: int
    favorite_count: int
    reply_count: int
    lang: str
    in_reply_to_status_id_str: str
    hashtags: list[str]
    user_mentions: list[str]
    urls: list[str]
    sentiment: str = ''

    @classmethod
    def from_payload(cls, payload):
        legacy_data = payload.get('legacy', {})
        entities_data = legacy_data.get('entities', {})
        return cls(rest_id=payload.get('rest_id', ''),
                   user_id=legacy_data.get('user_id_str', ''),
                   full_text=legacy_data.get('full_text', ''),
                   created_at=legacy_data.get('created_at', ''),
                   retweet_count=legacy_data.get('retweet_count', 0),
                   favorite_count=legacy_data.get('favorite_count', 0),
                   reply_count=legacy_data.get('reply_count', 0),
                   lang=legacy_data.get('lang', ''),
                   in_reply_to_status_id_str=legacy_data.get(
                       'in_reply_to_status_id_str', ''),
                   hashtags=[
                       tag['text']
                       for tag in entities_data.get('hashtags', [])
                   ],
                   user_mentions=[
                       mention['id_str']
                       for mention in entities_data.get('user_mentions', [])
                   ],
                   urls=[
                       url['expanded_url']
                       for url in entities_data.get('urls', [])
                   ])

and

from dataclasses import dataclass

@dataclass
class TwitterUser:
    id: str
    name: str
    screen_name: str
    statuses_count: int
    followers_count: int
    friends_count: int
    favourites_count: int
    listed_count: int
    default_profile: bool
    default_profile_image: bool
    location: str
    description: str
    description_has_url: bool
    description_url: str
    followers_to_following_ratio: float
    verified_type: str
    verified: bool
    is_blue_verified: bool
    has_graduated_access: bool
    can_dm: bool
    media_count: int
    has_custom_timelines: bool
    has_verification_info: bool
    possibly_sensitive: bool

    @classmethod
    def from_payload(cls, payload):
        legacy_data = payload.get('legacy', {})
        urls = [
            url['expanded_url'] for url in legacy_data.get('entities', {}).get(
                'description', {}).get('urls', [])
        ]
        followers_count = legacy_data.get('followers_count', 0)
        friends_count = legacy_data.get('friends_count', 0)

        return cls(
            id=payload.get('rest_id', ''),
            name=legacy_data.get('name', ''),
            screen_name=legacy_data.get('screen_name', ''),
            statuses_count=legacy_data.get('statuses_count', 0),
            followers_count=followers_count,
            friends_count=friends_count,
            favourites_count=legacy_data.get('favourites_count', 0),
            listed_count=legacy_data.get('listed_count', 0),
            default_profile=legacy_data.get('default_profile', False),
            default_profile_image=legacy_data.get('default_profile_image',
                                                  False),
            location=legacy_data.get('location', ''),
            description=legacy_data.get('description', ''),
            description_has_url=bool(urls),
            description_url=','.join(urls) if urls else '',
            followers_to_following_ratio=followers_count /
            friends_count if friends_count != 0 else 0,
            verified_type=legacy_data.get('verified_type', ''),
            verified=legacy_data.get('verified', False),
            is_blue_verified=payload.get('is_blue_verified', False),
            has_graduated_access=payload.get('has_graduated_access', False),
            can_dm=legacy_data.get('can_dm', False),
            media_count=legacy_data.get('media_count', 0),
            has_custom_timelines=legacy_data.get('has_custom_timelines',
                                                 False),
            has_verification_info=payload.get('verification_info', ''),
            possibly_sensitive=legacy_data.get('possibly_sensitive', False),
        )
iSarabjitDhiman commented 1 year ago

I'd be a good idea to have a library ~native~ data class, for example User or Twitt

That will help in not be drawn by the heavy payload the graphql endpoints has

For example I've been working on these two:

from dataclasses import dataclass

@dataclass
class TwitterTweet:
    rest_id: str
    user_id: str
    full_text: str
    created_at: str
    retweet_count: int
    favorite_count: int
    reply_count: int
    lang: str
    in_reply_to_status_id_str: str
    hashtags: list[str]
    user_mentions: list[str]
    urls: list[str]
    sentiment: str = ''

    @classmethod
    def from_payload(cls, payload):
        legacy_data = payload.get('legacy', {})
        entities_data = legacy_data.get('entities', {})
        return cls(rest_id=payload.get('rest_id', ''),
                   user_id=legacy_data.get('user_id_str', ''),
                   full_text=legacy_data.get('full_text', ''),
                   created_at=legacy_data.get('created_at', ''),
                   retweet_count=legacy_data.get('retweet_count', 0),
                   favorite_count=legacy_data.get('favorite_count', 0),
                   reply_count=legacy_data.get('reply_count', 0),
                   lang=legacy_data.get('lang', ''),
                   in_reply_to_status_id_str=legacy_data.get(
                       'in_reply_to_status_id_str', ''),
                   hashtags=[
                       tag['text']
                       for tag in entities_data.get('hashtags', [])
                   ],
                   user_mentions=[
                       mention['id_str']
                       for mention in entities_data.get('user_mentions', [])
                   ],
                   urls=[
                       url['expanded_url']
                       for url in entities_data.get('urls', [])
                   ])

and

from dataclasses import dataclass

@dataclass
class TwitterUser:
    id: str
    name: str
    screen_name: str
    statuses_count: int
    followers_count: int
    friends_count: int
    favourites_count: int
    listed_count: int
    default_profile: bool
    default_profile_image: bool
    location: str
    description: str
    description_has_url: bool
    description_url: str
    followers_to_following_ratio: float
    verified_type: str
    verified: bool
    is_blue_verified: bool
    has_graduated_access: bool
    can_dm: bool
    media_count: int
    has_custom_timelines: bool
    has_verification_info: bool
    possibly_sensitive: bool

    @classmethod
    def from_payload(cls, payload):
        legacy_data = payload.get('legacy', {})
        urls = [
            url['expanded_url'] for url in legacy_data.get('entities', {}).get(
                'description', {}).get('urls', [])
        ]
        followers_count = legacy_data.get('followers_count', 0)
        friends_count = legacy_data.get('friends_count', 0)

        return cls(
            id=payload.get('rest_id', ''),
            name=legacy_data.get('name', ''),
            screen_name=legacy_data.get('screen_name', ''),
            statuses_count=legacy_data.get('statuses_count', 0),
            followers_count=followers_count,
            friends_count=friends_count,
            favourites_count=legacy_data.get('favourites_count', 0),
            listed_count=legacy_data.get('listed_count', 0),
            default_profile=legacy_data.get('default_profile', False),
            default_profile_image=legacy_data.get('default_profile_image',
                                                  False),
            location=legacy_data.get('location', ''),
            description=legacy_data.get('description', ''),
            description_has_url=bool(urls),
            description_url=','.join(urls) if urls else '',
            followers_to_following_ratio=followers_count /
            friends_count if friends_count != 0 else 0,
            verified_type=legacy_data.get('verified_type', ''),
            verified=legacy_data.get('verified', False),
            is_blue_verified=payload.get('is_blue_verified', False),
            has_graduated_access=payload.get('has_graduated_access', False),
            can_dm=legacy_data.get('can_dm', False),
            media_count=legacy_data.get('media_count', 0),
            has_custom_timelines=legacy_data.get('has_custom_timelines',
                                                 False),
            has_verification_info=payload.get('verification_info', ''),
            possibly_sensitive=legacy_data.get('possibly_sensitive', False),
        )

Hey @fpmirabile It looks great. I will add these data classes soon. Here are a few things I might add to this :

And then we will use the data classes to get the data as you suggested.

Feel free to share your thoughts.

fpmirabile commented 1 year ago

Makes sense. I just copy/paste mine as suggestion since helped me out on cleaning up a lot of stuff. Thanks @iSarabjitDhiman

iSarabjitDhiman commented 1 year ago

Makes sense. I just copy/paste mine as suggestion since helped me out on cleaning up a lot of stuff. Thanks @iSarabjitDhiman

Yeah, I understand the pain of finding data out of those nested datasets. Your solution is great and is time saving. I just need to add some way to accept keys as a list then I will integrate it with the data classes. I will implement it soon.

Thanks for the idea. ✌️

codilau commented 1 year ago

I agree, it's helpfull to use objects for the tweets and users

iSarabjitDhiman commented 1 year ago

Hey @fpmirabile , @codilau

So I have been working on these dataclasses. I need a suggestion :

I am planning to call it User and Tweet instead of TwitterUser and TwitterTweet.

How would u like it to work?

from tweeterpy import TweeterPy
from tweeterpy.util import User,Tweet

twitter = TweeterPy()
user_data = twitter.get_user_data("elonmusk")
elon_musk = User(user_data)
# or we can do it in a single step
user = User(twitter.get_user_data("elonmusk"))

OR THE SECOND WAY, like @fpmirabile did, with the from_payload method.

from tweeterpy import TweeterPy
from tweeterpy.util import User,Tweet

twitter = TweeterPy()
user_data = twitter.get_user_data("elonmusk")
elon_musk = User.from_payload(user_data)

So should it be the first way User(dataset) or the second way User.from_payload(dataset)?

Same goes for Tweet. (Tweet.from_payload(dataset) or Tweet(dataset)?

Let me know your thoughts. I am open to suggestions.

Thanks for the idea @fpmirabile . Otherwise I wouldn't even bother to implement dataclasses. I thought users would navigate through the whole dataset themselves with the (find_nested_key) function. Because you know some might need some datapoints and others may not. But you are right, it's useful to provide some sort of pre-built template for the users which returns at least basic datapoints that everyone is interested in.

fpmirabile commented 1 year ago

You're welcome @iSarabjitDhiman is the less I can do since you are saving my on my engineering final project! Is the same for me, you could leave both (and the constructor calls the inner method). I think it is more about how you feel the library should work more than our way of interact with it.

Summarizing, I'm ok with both!

codilau commented 1 year ago

@iSarabjitDhiman, it's the same for me, it's useful however you choose to implement it.

iSarabjitDhiman commented 1 year ago

Hey @fpmirabile @codilau Just added those two dataclasses in the most recent commit 03ce45d Feel free to test it out and let me know if there are any changes to be made.

Check the docs here. Its in util.py module btw.

Edit : Assuming everything is working as intended, I am closing this issue now.

python502 commented 1 year ago

find some errors: how to skip it? description_urls: list[dict] = field(default_factory=list) TypeError: 'type' object is not subscriptable

iSarabjitDhiman commented 1 year ago

find some errors: how to skip it? description_urls: list[dict] = field(default_factory=list) TypeError: 'type' object is not subscriptable

Hey @python502 Could u please give me some details? Which type of tweets are you passing? Tweets from the profile? Individual tweet? Media tweets?

Just tell me which method u used to fetch tweet's data?

get_user_tweets, get_tweet? Or some other

Let me know, thanks.

Edit : Already fixed in https://github.com/iSarabjitDhiman/TweeterPy/commit/b97c8c32f28c995eeb0fcc7b97e1610aace77ecb Duplicate : #34