taspinar / twitterscraper

Scrape Twitter for Tweets
MIT License
2.39k stars 579 forks source link

Add tweet attributes about links, reply, media #231

Closed nukopy closed 4 years ago

nukopy commented 4 years ago

Thank you for your great scraping library!! I'm a heavy user of twitterscraper! I always use a modified version of your library in my local env, so offer its useful functions.

Modification

I did the following 3:

  1. add new tweet attributes about links, media and reply
  2. delete some RT attributes because these doesn't work
  3. refactoring for readability

New tweet attributes

Some un-used infomation of tweets about links, media and reply is extracted, added to Tweet's(in twitterscraper/tweet.py) attributes.

New attributes are about:

Delete some RT attributes

I deleted some RT attributes:

These were extracted from tweet_div in Tweet.from_soup(), however, tweet_div HTML attributes "data-retweet-id", "data-retweeter", a tag's "pretty-link js-user-profile-link" didn't work well. Therefore, these attributes should be deleted.

Refactoring for readability

This part is no big deal but important.

The Part of passing arguments to Tweet constructor's(around __init__ of Tweet) is low readability and difficult to modify this. Therefore, I divided these arguments by content types by commenting out.

End heading

I think my modifications is very useful at least in my local env, so surely useful for twitterscraper!!

I really appreciate your effort. Hopefully I can be of any help to you.

taspinar commented 4 years ago

@nukopy Thank you for this useful addition! I think it partly overlaps with some features which are added by https://github.com/taspinar/twitterscraper/pull/210 .

I plan to merge that PR since it was submitted earlier. So can you remove code related to "Tweet.img_urls" and "Tweet.is_media" attributes (if you also agree it is the same)?

Plus also here, don't you want to write the retrieved information to the output file?

nukopy commented 4 years ago

I think it partly overlaps with some features which are added by #210 . I plan to merge that PR since it was submitted earlier. So can you remove code related to "Tweet.img_urls" and "Tweet.is_media" attributes (if you also agree it is the same)?

I've just seen PR #210. It's almost the same as mine but DIFFERS from whether BeautifulSoup object(bs4 object) for only images is defined(soup_imgs at line 77 in my diff).

I added it would make debugging easier though it increases memory usage. What do you think about that?

Plus also here, don't you want to write the retrieved information to the output file?

Sorry, I forgot it! I'll write soon.

I'd like to ask you for confirmation, do you agree my refactoring(added comments)? I don't have confidence because I have little experiece for refactoring. Sorry for messing up git's diff.

twollnik commented 4 years ago

This PR is awesome! As a user of twitterscraper I prefer this PR to #210 because it adds more features (e.g. the information about replies). Some of the added features (like the hashtags list) I am already calculating myself. So if this PR is merged I would have to do less processing myself, which is always good.

@nukopy If you want I can review your code.

nukopy commented 4 years ago

@taspinar I wrote the retrieved information to the output CSV file and modified slightly. Modification is simple:

Please, make sure.

@twollnik I'm glad to hear your opinion. I would appreciate it if you review my code!

taspinar commented 4 years ago

This PR keeps getting better still :) Good comments by @twollnik

taspinar commented 4 years ago

I think I will merge this PR for now and release a new version 1.4.0 of twitterscraper. good work guys1