the-convocation / twitter-scraper

A port of n0madic/twitter-scraper to Node.js.
https://the-convocation.github.io/twitter-scraper/
MIT License
175 stars 38 forks source link

Missing tweets with `getTweetsByUserId` or `getTweets` #73

Closed Faareoh closed 6 months ago

Faareoh commented 6 months ago

Hi !

I've been using twitter-scraper for several months and I'm currently working on the refactor of a project that uses it.

I hadn't realized this before, but it seems that the accounts with the company tick doesn't return all the tweets that are normally visible from the profile.

I'm mainly scrapping account from the Ankama society, and one of his account has a enterprise checkmark : DOFUSfr

When I visit the profile directly on the website, logging in with the same account I'm using to scrape, I get several tweets, the last 5 of which are as follows: 1760349049813606561, 1760308395012165969, 1760329684535951500, 1760316001764159772, 1760265658149879889

And when I scrape the last 5 tweets, either by logging in via login or via cookies, I get the following tweets: 1760349049813606561, 1760316001764159772, 1760265658149879889, 1760251324468244641, 1760222882540425580

The tweets 1760308395012165969 and 1760329684535951500 are missing via the scraper. It's important to note that the last tweets always seem to be present, which means that I never realized there was a problem until today when I tried to refactor the project.

Here's the very simple code I've put in place to reproduce this issue

export default class TwitterScrapper extends Scraper {
  async init(): Promise<void> {
    try {
      await this.setCookies(COOKIES);
    } catch (error) {
      console.error('Error while connecting to Twitter', error);
    }
  }

  async fetch(): Promise<void> {
    if (!(await this.isLoggedIn())) {
      throw new Error('Not logged in');
    }

    const generators = this.getTweetsByUserId('72272795', 5)

    let result = await generators.next();

    while (!result.done) {
      const tweet = result.value;
      console.log(tweet.id);

      result = await generators.next();
    }
  }
}

const scrapper = new TwitterScrapper();
scrapper.init().then(() => scrapper.fetch());
Faareoh commented 6 months ago

Hi,

After more tests, it's not linked to the company account, the scrapper doesn't seem to return tweets that belong to a thread.

If I take this tweet as an example, I'll get it if I fetch it before a reply from the same account is made, but once a reply is made, the basic tweet and replies are not returned by the scrapper.

karashiiro commented 6 months ago

Dropping this here for reference (mostly for myself): When going to the account page, that fires off a request to https://twitter.com/i/api/graphql/LJwZwXzqk7wHyXPa3SQt4Q/UserTweets?variables=%7B%22userId%22%3A%2272272795%22%2C%22count%22%3A20%2C%22includePromotedContent%22%3Atrue%2C%22withQuickPromoteEligibilityTweetFields%22%3Atrue%2C%22withVoice%22%3Atrue%2C%22withV2Timeline%22%3Atrue%7D&features=%7B%22responsive_web_graphql_exclude_directive_enabled%22%3Atrue%2C%22verified_phone_label_enabled%22%3Afalse%2C%22creator_subscriptions_tweet_preview_api_enabled%22%3Atrue%2C%22responsive_web_graphql_timeline_navigation_enabled%22%3Atrue%2C%22responsive_web_graphql_skip_user_profile_image_extensions_enabled%22%3Afalse%2C%22c9s_tweet_anatomy_moderator_badge_enabled%22%3Atrue%2C%22tweetypie_unmention_optimization_enabled%22%3Atrue%2C%22responsive_web_edit_tweet_api_enabled%22%3Atrue%2C%22graphql_is_translatable_rweb_tweet_is_translatable_enabled%22%3Atrue%2C%22view_counts_everywhere_api_enabled%22%3Atrue%2C%22longform_notetweets_consumption_enabled%22%3Atrue%2C%22responsive_web_twitter_article_tweet_consumption_enabled%22%3Atrue%2C%22tweet_awards_web_tipping_enabled%22%3Afalse%2C%22freedom_of_speech_not_reach_fetch_enabled%22%3Atrue%2C%22standardized_nudges_misinfo%22%3Atrue%2C%22tweet_with_visibility_results_prefer_gql_limited_actions_policy_enabled%22%3Atrue%2C%22rweb_video_timestamps_enabled%22%3Atrue%2C%22longform_notetweets_rich_text_read_enabled%22%3Atrue%2C%22longform_notetweets_inline_media_enabled%22%3Atrue%2C%22responsive_web_enhance_cards_enabled%22%3Afalse%7D, but even that exact query doesn't work when swapping it into this line.

karashiiro commented 6 months ago

Looks like the expected tweets have different structures in the API response: Tweet 1760308395012165969 is included within a TimelineTimelineModule resource with the entry ID profile-conversation-1761788173412728843, rather than the expected tweet-1760308395012165969. Seems to be getting filtered out here.

karashiiro commented 6 months ago

Should be fixed in v0.9.3.