--since not working properly

dmuth commented 4 years ago

This is a followup to my Dockerized version of Twint that I mentioned in #579.

Unfortunately, this won't be a run-of-the-mill environment because Docker, but I'll get into that further down.

[3.7] Python version is 3.6;
- Python is 3.7, because that's what ships in the latest version of Docker's Alpine image. If this is a serious impediment to investigation, I can try building a 3.6 image, but it may take some work
[DOCKER] Updated Twint with pip3 install --user --upgrade -e git+https://github.com/twintproject/twint.git@origin/master#egg=twint;
- The bad news is that I had to clone your repo so I could remove Pandas, as described in #579. Sorry about that. The good news is that the Docker image is essentially a clean build. So that rules out the "ancient version of Twint" possibility.
[X] I have searched the issues and there are no duplicates of this issue/question/request.
- I searched through the first two pages of results from a search for is:issue since and nothing popped out at me.

Command Ran

$ bash <(curl -s https://raw.githubusercontent.com/dmuth/twint-splunk/master/twint) -u dmuth --since 2009-08-01 --until 2009-09-01 --json -o tweets.json | pv -l >/dev/null
$ tail -n1 tweets.json | jq -r .id,.date
2455812675
2009-07-03

Description of Issue

The oldest tweet fetched is from July 3, 2009. Expectation was that no tweet would be older than August 1st, 2009.

Environment Details

Docker on OS/X, as described in #579. The command above should work without issue on any machine with Docker installed.

Let me know if you need anything else, thank you!

-- Doug

pielco11 commented 4 years ago

I fixed what seems to be the error, basically I was splitting the whole interval into smaller pieces to prevent Twitter from blocking our requests since it sees that we are (could be) querying a large datetime-frame

It happened in the past that Twitter stopped way before reaching Since date, and we suspected that it blocked our "too large" request. So the workaround was to ask to Twitter from smaller dataframes.

So I removed this option since it creates more troubles than it tries to solve; the easy solution to the described issue is just to run Twint with smaller datetime-frames, which turns out to have the same effects

Now, before pushing around, I kindly ask you to upgrade twint with pip3 install --user --upgrade -e git+https://github.com/twintproject/twint.git@origin/master#egg=twint and let me know your results. In my tests everything went as expected and the issues seems to be resolved

dmuth commented 4 years ago

Hi,

Totally understandable--it's not like Twitter has any obligation to make this easy for us. :-) I confirmed that this works:

$ tail -n1 tweets.json | jq -r .id,.date
3063824418
2009-08-01

$head -n1 tweets.json | jq -r .id,.date
3676643901
2009-08-31

However, while running that command, I saw this pop up 6 times:

CRITICAL:root:twint.run:Twint:Feed:noDataExpecting value: line 1 column 1 (char 0)

Is that anything we should worry about? If not, I'll close this out, but I think adjusting the severity would be a good idea. :-)

Thanks,

-- Doug

pielco11 commented 4 years ago

I do believe that those errors are strongly related to #567

If you go in url.py and modify the base_urls (in the firsts lines of code) so replace https with http, and go inget.pyand edit theResponsefunction replacingssl=Truewithssl=False`.. you should get less error messages, at least that's what I get

Feel free to provide any feedback/suggestion

PS: I'd discuss about the errors' issue in the right one, just to keep everything in the right place

dmuth commented 4 years ago

Hmm, that's really strange, since those errors don't seem to relate to SSL at all. As it stands, turning off SSL is not a great idea so I won't. :-)

Since the bug reported has been fixed, I'm gonna close this out. I may open another bug in the future about that CRITICAL, depending how often it comes up and if I see it affecting the data retrieved in any way.

-- Doug

twintproject / twint