twintproject / twint

An advanced Twitter scraping & OSINT tool written in Python that doesn't use Twitter's API, allowing you to scrape a user's followers, following, Tweets and more while evading most API limitations.
MIT License
15.66k stars 2.72k forks source link

--since not working properly #580

Closed dmuth closed 4 years ago

dmuth commented 4 years ago

This is a followup to my Dockerized version of Twint that I mentioned in #579.

Unfortunately, this won't be a run-of-the-mill environment because Docker, but I'll get into that further down.

Command Ran

$ bash <(curl -s https://raw.githubusercontent.com/dmuth/twint-splunk/master/twint) -u dmuth --since 2009-08-01 --until 2009-09-01 --json -o tweets.json | pv -l >/dev/null
$ tail -n1 tweets.json | jq -r .id,.date
2455812675
2009-07-03

Description of Issue

The oldest tweet fetched is from July 3, 2009. Expectation was that no tweet would be older than August 1st, 2009.

Environment Details

Docker on OS/X, as described in #579. The command above should work without issue on any machine with Docker installed.

Let me know if you need anything else, thank you!

-- Doug

pielco11 commented 4 years ago

I fixed what seems to be the error, basically I was splitting the whole interval into smaller pieces to prevent Twitter from blocking our requests since it sees that we are (could be) querying a large datetime-frame

It happened in the past that Twitter stopped way before reaching Since date, and we suspected that it blocked our "too large" request. So the workaround was to ask to Twitter from smaller dataframes.

So I removed this option since it creates more troubles than it tries to solve; the easy solution to the described issue is just to run Twint with smaller datetime-frames, which turns out to have the same effects

Now, before pushing around, I kindly ask you to upgrade twint with pip3 install --user --upgrade -e git+https://github.com/twintproject/twint.git@origin/master#egg=twint and let me know your results. In my tests everything went as expected and the issues seems to be resolved

dmuth commented 4 years ago

Hi,

Totally understandable--it's not like Twitter has any obligation to make this easy for us. :-) I confirmed that this works:

$ tail -n1 tweets.json | jq -r .id,.date
3063824418
2009-08-01

$head -n1 tweets.json | jq -r .id,.date
3676643901
2009-08-31

However, while running that command, I saw this pop up 6 times:

CRITICAL:root:twint.run:Twint:Feed:noDataExpecting value: line 1 column 1 (char 0)

Is that anything we should worry about? If not, I'll close this out, but I think adjusting the severity would be a good idea. :-)

Thanks,

-- Doug

pielco11 commented 4 years ago

I do believe that those errors are strongly related to #567

If you go in url.py and modify the base_urls (in the firsts lines of code) so replace https with http, and go inget.pyand edit theResponsefunction replacingssl=Truewithssl=False`.. you should get less error messages, at least that's what I get

Feel free to provide any feedback/suggestion

PS: I'd discuss about the errors' issue in the right one, just to keep everything in the right place

dmuth commented 4 years ago

Hmm, that's really strange, since those errors don't seem to relate to SSL at all. As it stands, turning off SSL is not a great idea so I won't. :-)

Since the bug reported has been fixed, I'm gonna close this out. I may open another bug in the future about that CRITICAL, depending how often it comes up and if I see it affecting the data retrieved in any way.

-- Doug