taspinar / twitterscraper

Scrape Twitter for Tweets
MIT License
2.4k stars 581 forks source link

Adds is_retweet and retweeter related information #186

Closed kanihal closed 5 years ago

kanihal commented 5 years ago

Additional commits

taspinar commented 5 years ago

@kanihal This seems like an very useful addition, but I am a little bit surprised because I have never seen the 'data-retweet-id' attribute before. Can you give an example of this in practice? (twitter website)

kanihal commented 5 years ago

Consider twitter page of Stanford NLP group - https://twitter.com/stanfordnlp Here they have retweeted a tweet with id=1139508286418386944 from Victor Zhong (hllo_wrld). link - https://twitter.com/hllo_wrld/status/1139508286418386944

On stanfordnlp page, if you search for hllo_wrld and inspect that retweeted element (currently 2nd tweet from the top on their timeline), You can see div class="tweet ...

<div class="tweet js-stream-tweet js-actionable-tweet js-profile-popup-actionable dismissible-content original-tweet js-original-tweet tweet-has-context has-cards has-content MemexAdded" data-tweet-id="1139508286418386944" data-item-id="1139508286418386944" data-permalink-path="/hllo_wrld/status/1139508286418386944" data-conversation-id="1139508286418386944" data-tweet-nonce="1139508286418386944-f3556e9a-126c-4200-b398-3271a5c367f4" data-tweet-stat-initialized="true" data-retweet-id="1140033153106509825" data-retweeter="stanfordnlp" data-screen-name="hllo_wrld" data-name="Victor Zhong" data-user-id="257287707" data-you-follow="false" data-follows-you="false" data-you-block="false" data-tagged="hllo_wrld LukeZettlemoyer uwnlp" data-reply-to-users-json="[{&quot;id_str&quot;:&quot;257287707&quot;,&quot;screen_name&quot;:&quot;hllo_wrld&quot;,&quot;name&quot;:&quot;Victor Zhong&quot;,&quot;emojified_name&quot;:{&quot;text&quot;:&quot;Victor Zhong&quot;,&quot;emojified_text_as_html&quot;:&quot;Victor Zhong&quot;}},{&quot;id_str&quot;:&quot;118263124&quot;,&quot;screen_name&quot;:&quot;stanfordnlp&quot;,&quot;name&quot;:&quot;Stanford NLP Group&quot;,&quot;emojified_name&quot;:{&quot;text&quot;:&quot;Stanford NLP Group&quot;,&quot;emojified_text_as_html&quot;:&quot;Stanford NLP Group&quot;}},{&quot;id_str&quot;:&quot;3741979273&quot;,&quot;screen_name&quot;:&quot;LukeZettlemoyer&quot;,&quot;name&quot;:&quot;Luke Zettlemoyer&quot;,&quot;emojified_name&quot;:{&quot;text&quot;:&quot;Luke Zettlemoyer&quot;,&quot;emojified_text_as_html&quot;:&quot;Luke Zettlemoyer&quot;}},{&quot;id_str&quot;:&quot;3716745856&quot;,&quot;screen_name&quot;:&quot;uwnlp&quot;,&quot;name&quot;:&quot;UW NLP&quot;,&quot;emojified_name&quot;:{&quot;text&quot;:&quot;UW NLP&quot;,&quot;emojified_text_as_html&quot;:&quot;UW NLP&quot;}}]" data-disclosure-type="" data-has-cards="true">

Here

Use cases:

taspinar commented 5 years ago

@kanihal Thank you for the information. I think this will be a very useful addition to twitterscraper.

The reason I have not merged it yet is because it seems to only work in addition with the --user argument, i.e. when you are scraping tweets from an user profile page. When you are searching for tweets in the regular way the additional information regarding the retweeter is not provided and it will result in the output containing a lot of / only empty values for these additional fields.

So I am thinking that it would be better if these additional values regarding retweets should only be provided in combination with the --user argument. But is it better to merge this PR for now and make the changes in a new PR or incorporate these changes in this PR? What do you think?

kanihal commented 5 years ago

Yes, It makes sense to process retweet related information only with --user option. I can send additional PR that does this.