twintproject / twint

An advanced Twitter scraping & OSINT tool written in Python that doesn't use Twitter's API, allowing you to scrape a user's followers, following, Tweets and more while evading most API limitations.
MIT License
15.78k stars 2.72k forks source link

Huge Feature possibility -- Ability to use more precise "Since" and "Until" equivalents #486

Open pushshift opened 5 years ago

pushshift commented 5 years ago

As we all know, the Twitter search feature only allows the date for since and until which is a huge pain in the ass for recovering at specific points. However, you can pass max_id and min_id to Twitter search. Here is an example: https://twitter.com/search?q=truck%20max_id%3A1145726304651747328&src=typed_query&f=live

You're probably thinking, "That's great, but those aren't datetimes." Well, the datetime of tweets made with the snowflake implementation are backed into the ids! So you can translate a datetime object to a twitter id and use that id as a boundary marker to simulate much more precise since and until flags.

Here's the code to convert a Twitter ID to microsecond epoch:

(tweet_id >> 22) + 1288834974657 -- This gives the millisecond epoch of when the tweet was created.

Now here's the magical one:

(millisecond_epoch - 1288834974657) << 22 = tweet id

So let's say we want to get Tweets that have the term "magic" in them from February 3, 2015 at 9:37 am eastern standard time. First, we need to convert that date to millisecond epoch. That translates to 1422974220 epoch for the start of the minute and 1422974280 for the end of the minute (60 seconds). We multiply them by 1,000 and use the formula above to get the min_id and max_id boundaries:

min_id = (1422974220000 - 1288834974657) << 22 = 562620773299126272 max_id = (1422974280000 - 1288834974657) << 22 = 562621024957366272

Now let's test this on Twitter:

https://twitter.com/search?q=the%20max_id%3A562621024957366272&src=typed_query&f=live

It looks like it has problems with both min_id and max_id at once, but max_id does indeed show tweets with "magic" in it starting exactly at 2015-02-03 9:37 am Eastern time.

This should open the door to a lot of really cool possibilities including more exact timeline targeting for search and resume capabilities since we can resume at a specific time.

pushshift commented 5 years ago

Here is the equivalent command in twint:

twint -s "magic max_id:562621024957366272"

What I suggest is that we make a flag for this that will automatically convert YYYY-MM-DD HH:MM:SS to the correct max_id. This will allow people to target very specific parts of the timeline down to the second.

So something like:

twint -s "magic" --precise_until 2015-02-03 14:37:00

We could even just replace the current until and since with this more precise method. I believe Twitter will allow min_id or max_id but not both -- but that shouldn't really be an issue. This will be a HUGE help to get around a lot of problems with since and until being so inaccurate.

pielco11 commented 5 years ago

That would be an amazing feature!

Confirmed that both max_id and min_id doesn't work, but we actually need max_id only in the first request, and then just place min_id in the further requests until the "limit" is not reached

llunn commented 4 years ago

@pielco11

Confirmed that both max_id and min_id doesn't work, but we actually need max_id only in the first request, and then just place min_id in the further requests until the "limit" is not reached

Just to clarify, are these statements correct?

  1. On the first request, the value of init passed to url.Search will always be -1

  2. If the value of init is -1 and config.Since is defined then on the first request max_id needs to be set.

    If the value init is not -1 and config.Until is set then min_id needs to be set based on config.Until until Limit is reached or min_id is in the feed.

    If the value of init is not -1 and config.Until is not set, then neither min_id nor max_id are required in subsequent requests (i.e. max_position as defined by init is controlling the feed at this point up to Limit tweets returned or no more data is encountered) .

  3. If the value of init is -1 and config.Until is defined then on the first request request min_id needs to be set based on config.Until.

    On subsequent requests, min_id should be set based on config.Until.

    Requests should continue up until min_id is in the feed or Limit has been reached.