twintproject / twint

An advanced Twitter scraping & OSINT tool written in Python that doesn't use Twitter's API, allowing you to scrape a user's followers, following, Tweets and more while evading most API limitations.
MIT License
15.7k stars 2.71k forks source link

[ERROR] CRITICAL:twint.run:Twint:Feed:noDataExpecting ~ Inconsistent results [High Severity] #604

Open ghost opened 4 years ago

ghost commented 4 years ago

Python 3.6 twint 2.1.7 updated from master Have searched issues without finding anything Running on Ubuntu 18.04, anaconda, jupyter notebook

Commands run:

import twint
import pandas as pd

c = twint.Config()
c.Pandas=True
c.Search = "#nfl"
c.Hide_output=True
c.Since = '2019-12-01'
c.Until = '2019-12-02'

twint.run.Search(c)
df = twint.storage.panda.Tweets_df

Hi, thanks for writing this package, it's very useful. I'm clearly not using it right though. I ran the commands above as a test, using"#nfl" as a query because it's innocuous and guaranteed to have a lot of results over the course of one day, but I am getting inconsistent results.

First, when I run it I get a lot of these warnings (which I saw from another issue are probably related to http/https?): CRITICAL:twint.run:Twint:Feed:noDataExpecting value: line 1 column 1 (char 0) That's fine though, the script still runs. The problem is the results are inconsistent. I ran it last night and got back 6,832 tweets, then ran it again this morning as part of testing some other code and got 4,710 tweets. When I saw that I ran it again and got 0 tweets.

I have a couple of questions if that's okay. Is twint caching the results of queries somewhere, and if so, how do I clear the cache? Is this inconsistent behaviour expected (is it a Twitter search page thing?) and if so, does it make sense to run the same search multiple times and concatenate results? Finally, is there a suggested best practice for searching date ranges? (i.e. If you want all the tweets for a hashtag for the past 3 months, is it better to do one big search or break the search into daily or weekly time ranges?)

Again, thanks for this package. Great work.

pielco11 commented 4 years ago

Twint does not cache results, queries or anything else. Every single piece of data is provided by Twitter. It makes sense to run multiple searches until it makes sense to you, for example you could monitor specific hashtags and see which user deletes more tweets than others. On the other side, you will most probably get duplicates if you don't filter the data with since and until (or other parameters).

If the hashtag is really popular, I'd split the searches in months if not even weeks.

ghost commented 4 years ago

Thanks. So, is it normal though that if I run the exact same search for the same query 12 hours apart that I get very different numbers of tweets back? When I ran it yesterday I got ~6800 results and when I ran it again this morning I only got ~4700. (When I ran it a third time it returned 0 tweets??)

It's not really that big a deal, I'm just hoping to understand the expected behaviour to explain to analysts when I explain the data to them. If I'm only getting a subset, or a random subset of the data, that's fine, but I need to understand that before they ask me about it.

Thanks

pielco11 commented 4 years ago

If you run the same script over time, so with the same date-time ranges and other parameters, the dataset should be "constant" since it does not depend by the relative time which you run the script at.

So if we have a variation and what can variate is the time interval or the dataset itself, I guess that (in this case) is the dataset that's changing. Most probably due to deleted tweets, accounts that go from open to closed, accounts that get delete/suspended.

Let's say that tweets with id = 1,2,3 are sent in the interval time (A,B). If we run the scripts at C with C > B, and the tweets don't get deleted/shadowed et similia , we'll always get tweets with id = 1,2,3. The interval is closed and nothing can go it or get out by itself. The only way which a tweet can go out of that interval is just being deleted/shadowed and so on.

ghost commented 4 years ago

Thank you. I guess my strange results were a glitch. I mean, it's unlikely that 2000 tweets got deleted for that hashtag overnight. I'll keep experimenting.

Much appreciated.

pielco11 commented 4 years ago

I can guarantee you that what Twint returns is what Twitter gives (and this can be proven, just run with --debug or config.Debug = True and in the twint-request_urls.log file you'll see every request made, and you can replicates the request with any software of your choice).

You can run the script with your handle as target, if you always get the same tweets everything is fine otherwise it needs to be investigated.

Best of luck!

ghost commented 4 years ago

Thanks, I'm sure it's something I'm doing. I will keep at it.

ghost commented 4 years ago

Sorry to come back to this, but just wanted you to be aware of the results of my testing. I ran this exact code multiple times in a row, and each time it returned very different results. I don't know if there is something wrong with my code (please tell me if so!) or if there is something else going on.

(running with python 3.6 and fully updated twint)

c = twint.Config()
c.Hide_output=True
c.Pandas_clean=True
c.Pandas=True
c.Search="#nfl"
c.Since='2019-12-01'
c.Until='2019-12-02'

twint.run.Search(c)

I just ran that exact code 4 times in a row. It returned this many tweets: Run #1: 1,909 tweets Run #2: 280 tweets Run #3: 13,207 tweets Run #4: 3,015 tweets

I'm really not sure what to do with that at this point? Am I doing something wrong? On the three runs with the lower values, it also returned this

CRITICAL:root:twint.run:Twint:Feed:Tweets_known_error:Expecting value: line 1 column 1 (char 0)
Expecting value: line 1 column 1 (char 0) [x] run.Feed
[!] if get this error but you know for sure that more tweets exist, please open an issue and we will investigate it!

Any advice would be welcome.

Edit: For what it's worth, it's definitely something to do with the feed being disrupted. That error is being thrown in the Feed method of Twint, and it seems to happen most of the time. I re-ran that script in a loop a bunch of times, and 13, 207 tweets seems to be the actual correct number, but it doesn't come back with that very often.

pielco11 commented 4 years ago

I've tried your query and I can confirm that the results are not consistent. That's really strange and needs to be investigated.

I'll keep you updated

ghost commented 4 years ago

Thanks!

pielco11 commented 4 years ago

So it seems that having HTTPS or not does not always have effects, for now my findings are there (I've run the script 3 times):

==> nfl1.csv <==
1201274724778565632,1201274724778565632,1575241187000,2019-12-01,23:59:47,CET,1180200593962369024,yotesglendale,GlendaleCardinals,,Brandin Cooks leaves Budda Baker in the dust💨 #NFL #NFLSunday #LARams #LAvsAZ #redsea pic.twitter.com/QowdS89R0S,[],[],[],0,0,0,"['#nfl', '#nflsunday', '#larams', '#lavsaz', '#redsea']",[],https://twitter.com/YotesGlendale/status/1201274724778565632,False,,1,,,,,,,"[{'user_id': '1180200593962369024', 'username': 'YotesGlendale'}]",,,,

==> nfl2.csv <==
1201274724778565632,1201274724778565632,1575241187000,2019-12-01,23:59:47,CET,1180200593962369024,yotesglendale,GlendaleCardinals,,Brandin Cooks leaves Budda Baker in the dust💨 #NFL #NFLSunday #LARams #LAvsAZ #redsea pic.twitter.com/QowdS89R0S,[],[],[],0,0,0,"['#nfl', '#nflsunday', '#larams', '#lavsaz', '#redsea']",[],https://twitter.com/YotesGlendale/status/1201274724778565632,False,,1,,,,,,,"[{'user_id': '1180200593962369024', 'username': 'YotesGlendale'}]",,,,

==> nfl3.csv <==
1201274724778565632,1201274724778565632,1575241187000,2019-12-01,23:59:47,CET,1180200593962369024,yotesglendale,GlendaleCardinals,,Brandin Cooks leaves Budda Baker in the dust💨 #NFL #NFLSunday #LARams #LAvsAZ #redsea pic.twitter.com/QowdS89R0S,[],[],[],0,0,0,"['#nfl', '#nflsunday', '#larams', '#lavsaz', '#redsea']",[],https://twitter.com/YotesGlendale/status/1201274724778565632,False,,1,,,,,,,"[{'user_id': '1180200593962369024', 'username': 'YotesGlendale'}]",,,,

So as we can see, Twint starts always at the same point. Which is good.

Now we have to see where it stops

==> nfl1.csv <==
1201248296645398528,1201248296645398528,1575234886000,2019-12-01,22:14:46,CET,178163508,randi_heatlifer,Randi Hilsercop,,The jets are so bad that the bengals just beat them 😂😂 #NYJvsCIN #NFL  pic.twitter.com/5OEHqX1I45,[],[],[],0,0,0,"['#nyjvscin', '#nfl']",[],https://twitter.com/Randi_heatlifer/status/1201248296645398528,False,,1,,,,,,,"[{'user_id': '178163508', 'username': 'Randi_heatlifer'}]",,,,

==> nfl2.csv <==
1200912390356783104,1200912390356783104,1575154800000,2019-12-01,00:00:00,CET,4059670933,blowoutbuzz,BlowoutBuzz,,YourDozen: Get your NFL Week 13 picks in to win 2017-19 sets >>  http://bit.ly/2Ok7pdQ  #collect @PaniniAmerica #TheHobby #NFL #predictions #picks pic.twitter.com/ZQFTduKvrh,['paniniamerica'],['http://bit.ly/2Ok7pdQ'],['https://pbs.twimg.com/media/EKkj3X2WsAALfcr.jpg'],0,0,0,"['#collect', '#thehobby', '#nfl', '#predictions', '#picks']",[],https://twitter.com/BlowoutBuzz/status/1200912390356783104,False,,0,,,,,,,"[{'user_id': '4059670933', 'username': 'BlowoutBuzz'}, {'user_id': '44128979', 'username': 'PaniniAmerica'}]",,,,

==> nfl3.csv <==
1201245479746646022,1201245479746646022,1575234215000,2019-12-01,22:03:35,CET,2691199254,seahawksreddit,/r/Seahawks,, https://ift.tt/2DyJ51V  Ravens Win! #Seahawks #NFL #GoHawks,[],['https://ift.tt/2DyJ51V'],[],0,1,6,"['#seahawks', '#nfl', '#gohawks']",[],https://twitter.com/SeahawksReddit/status/1201245479746646022,False,,0,,,,,,,"[{'user_id': '2691199254', 'username': 'SeahawksReddit'}]",,,,

If we take a look at the twint-last-request.log file (when Twint exists with error)

      <div id="main_content">
            <div class="system">
      <div class="blue">
        <table class="content">
          <tr>
            <td>
              <div class="title">Sorry, that page doesn't exist</div>
              <div class="subtitle"><a href="/">Back to home</a></div>
            </td>
          </tr>
        </table>
      </div

If we take a look at the latest scraped tweet in nfl2.csv we see that its time is 00:00:00 (relative of my TZ) which is good, it says that we reached the end of "my day".

A note about the time-zone. If we run the same with two different local times, most probably we'll get different results since my start (end) of the day is different that yours. That said, our aim is not to be sure that one gets each other's results, our aim is to get the same results each time we ask for them, comparing results individually. (FYI I got 25282 tweets)

Reasons why the issue might be related to HTTP(s) switch:

Reasons why the issue might not be related to that switch:

What happens when Twint gets those error messages:

1) Twint changes the UserAgent https://github.com/twintproject/twint/blob/3a4f778233257dd902f6557a38998bfcc3a046bc/twint/run.py#L89-L94 2) Twint re-runs the same request (since self.feed and self.init don't change, the request's params are still the same)

Sometimes, luckily, error messages are not printed, even if running the same query. In such case the only thing that Twint has different than in the other runs is the UserAgent.

So maybe Twitter plays differently based on the UserAgent specified.

Updates soon.

pielco11 commented 4 years ago

It seems that using Mozilla/5.0 (Windows NT 5.1) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/41.0.2224.3 Safari/537.36 as UserAgent (as suggested by @o7n in #587), allows us to get almost all the expected results

So I suggest you to edit lines 158 and 159 in url.py and to specify that user-agent, or another one suggested. Then please run and let me know if you get consistent results @jomorrcode

ghost commented 4 years ago

Hmmm. My installed url.py only has 146 lines (installed twint 2.1.9 with pip3 install --user --upgrade -e git+https://github.com/twintproject/twint.git@origin/master#egg=twint as per the instructions).

I looked through it and couldn't find any reference to user agent. I found it several places in run.py, but I didn't want to just start randomly changing the code without knowing that I was doing.

pielco11 commented 4 years ago

Sorry I mean in get.py, my bad https://github.com/twintproject/twint/blob/3a4f778233257dd902f6557a38998bfcc3a046bc/twint/get.py#L155-L161

Here is you'll end up with immagine

ghost commented 4 years ago

Sorry, finally had a chance to try this. Yes, I made that change with that user agent and ran the code 4 times, and got back 13,011 tweets each time.

ghost commented 4 years ago

Just to update, running with that user agent ususally seems to bring back a fairly complete list of tweets, but does randomly fail and a smaller subset are returned. I am playing around with a workaround to look at the last tweet returned to see if it's timestamp is close to 00:00:00, and if not, redo the query. Not sure if there's a more effective way to detect that the scrape finished early as a workaround.

o7n commented 4 years ago

It would help a lot if there was some way to know if there were any errors during the search or profile request. Now the only indication of something going badly is a message on stderr, if any.

ghost commented 4 years ago

I thought I'd just add that after a lot more experimenting, results continue to be inconsistent regardless of the User Agent, though quite randomly so. Sometimes I can run the same code 3 times and get the exact same number of tweets and other times it will return a much smaller number, or even zero tweets.

pielco11 commented 4 years ago

And does in happen regardless the query?

ghost commented 4 years ago

Huh, oddly it does seem to vary with the query. Some queries I tried seem to always return the same number, others vary a lot. Really strange that. That football hashtag (#nfl) is always very variable, but something like #france or #germany seems to be consistent.

pielco11 commented 4 years ago

That makes the debugging even harder, as of now I'd exclude a flaw in Twint. So I guess that somehow the issue is related to Twitter

It'd be interesting to check if it returns less tweets even if it reaches the end of the day. Because if it stops before for unknown reasons, you could just resume from that point.

To try this out just run something like twint -s "#nfl" --debug --resume "test_1.session" --since "2019-12-18" --until "2019-12-19" --csv -o "test_1.csv"

import twint

c = twint.Config()
c.Search = "#nfl"
c.Debug = True
c.Resume = "test_1.session"
c.Since = "2019-12-18"
c.Until = "2019-12-19"
c.Store_csv = True
c.Output = "test_1.csv"

twint.run.Search(c)

If it does not stop at 00:00:00, you can just re-run the script/command as is and it will resume from where it started. You might want to apply some more complex logging, but keeping track of session files would be enough. Please consider that at every run the session file is overwritten, so you would like to do something like python3 script.py && cat test_1.session >> history.sessions or twint .... >> history.session. So that we can compare the session ids with the debug files (twint-request_urls.log is not overwritten at every run)

ghost commented 4 years ago

Oh, that is perfect. I will try that. I was trying to do something like that myself by checking for whether the last tweet collected was close to 00:00:00, but when it wasn't I had to rerun the whole script instead of just restarting from where it stopped.

So if I understand correctly, as long as the script hasn't terminated, re-running twint.run.search(c) will restart from the last tweet collected (assuming it hadn't reached 00:00:00), so a simple loop with a check for the latest time collected should do the trick.

Thanks, you've been very helpful with this.

o7n commented 4 years ago

by checking for whether the last tweet collected was close to 00:00:00

That is a bit too subjective for me. When is a tweet "close" to 00:00:00 ? And for search queries with a lower volume there is no visible difference between "no tweet exists" and "no tweet was received".

ghost commented 4 years ago

That is a bit too subjective for me. When is a tweet "close" to 00:00:00 ? And for search queries with a lower volume there is no visible difference between "no tweet exists" and "no tweet was received".

This is very true. For what it's worth, because I was looking at fairly active search terms, I was considering it an incomplete search if no tweets were returned within 30 mins of 00:00:00, but as you say, that's not actually an effective way to handle this since it could easily happen that by chance there were no such tweets. I was including a cut out so that if the search ran 4 times without reaching 00:00:00 it would exit gracefully and print an alert. Again, a hack to get around this that doesn't actually work that well.

I don't have a better idea at the moment though.

ghost commented 4 years ago

I do see that the file "twint-last-request.log" contains a field "has_more_items":true. Is that something that can be accessed and used as a check before the script terminates to say, if has_more_items is true, re-run the search and resume from the last tweet collected? I'm not sure where that information resides before it is written to the log file.

pielco11 commented 4 years ago

@o7n you are absolutely right, but with the sample cited above we suspect that the latest tweets is 1200912390356783104,1200912390356783104,1575154800000,2019-12-01,00:00:00,CET. So considering this testing query (since with smaller ones I'm unable to replicate the issue) we can try to better understand what happens

As you said, and as we all agree, that's not a general rule. Instead it's a case-specific one

pielco11 commented 4 years ago

@jomorrcode so you mean like if at the request N we get has_more_items:true and at the request N+1 Twint breaks, Twint retries the query since we now that there are more tweets?

That sounds good, indeed when Twint breaks we lose the information about the previous request. Let me check what we can do

pielco11 commented 4 years ago

I create a new branch for this

https://github.com/twintproject/twint/tree/workaround-604

I'll push fixes there, so please remember to pull from that branch!

pielco11 commented 4 years ago

Added a new log file twint-requests-deep.csv, that contains rows formatted as follows f"had_more_items:{self._has_more_items};has_more_items:{self.has_more_items};init:{self.init};len_feed:{len(self.feed)}"

This will help us tracking down the requests.

If a large amount of tweets will be missing from a bigger set, we expect to see a difference in twint-requests-deep.csv

ghost commented 4 years ago

I may not have a chance to do it before the holidays, but I will do some structured tests when I get back and try and keep some records of results etc.

ghost commented 4 years ago

Hi, I did a couple of tests using both the Master branch and then the workaround branch, with a fresh install, I ran your query for "#nfl" for 2019-12-01 to 2019-12-02. I made sure to uninstall the Master branch before installing the workaround branch.

Results for 4 runs with Master Branch: Test 1: time= 1 min, total tweets = 680, last tweet = 1201347059531468800 @ 22:47:13 Multiple errors: CRITICAL:root:twint.run:Twint:Feed:noDataExpecting value: line 1 column 1 (char 0) CRITICAL:root:twint.run:Twint:Feed:Tweets_known_error:Expecting value: line 1

Test 2: time= 10 sec, total tweets = 0 No errors returned at all

Test 3: time= 11 mins, total tweets = 13,074, last tweet = 1201003007342723073 @ 00:00:05 A few errors: CRITICAL:root:twint.run:Twint:Feed:noDataExpecting value: line 1 column 1 (char 0)

Test 4: time= 1 min, total tweets = 100, last tweet = 1201361765176659969 @ 23:45:39 Multiple errors: CRITICAL:root:twint.run:Twint:Feed:noDataExpecting value: line 1 column 1 (char 0) CRITICAL:root:twint.run:Twint:Feed:Tweets_known_error:Expecting value: line 1

Results for 4 runs with Workaround branch Test 5: time = 14 mins, total tweets = 13,074, last tweet = 1201003007342723073 @ 00:00:05 Dozens of errors CRITICAL:root:twint.run:Twint:Feed:noDataExpecting value: line 1 column 1 (char 0) CRITICAL:root:twint.run:Twint:Feed:Tweets_known_error:Expecting value: line 1

Test 6-8: basically identical to #5 (between 13,070 and 13,080 tweets, all ending with the same last tweet)

If it's useful I save each of the session, request logs, and deep_test files for each of those, as well as the actual tweets csvs. Hope that helps.

AldebaranNapoli commented 4 years ago

twint -s meloni --since 2019-01-01 --until 2019-12-25 --stats --count -es localhost:9200 ( i launched after modifying the user agent like here https://github.com/twintproject/twint/issues/604#issuecomment-565633980 ) Schermata a 2019-12-26 21-54-31 here the result twint -s meloni --since 2019-01-01 --until 2019-12-25 --stats --count -es localhost:9200 [+] Indexing to Elasticsearch @ localhost:9200 ........................................................................................................................................................................................................................................................................................CRITICAL:root:twint.run:Twint:Feed:noDataExpecting value: line 1 column 1 (char 0) CRITICAL:root:twint.run:Twint:Feed:noDataExpecting value: line 1 column 1 (char 0) CRITICAL:root:twint.run:Twint:Feed:noDataExpecting value: line 1 column 1 (char 0) CRITICAL:root:twint.run:Twint:Feed:noDataExpecting value: line 1 column 1 (char 0) ........................................................................................................................................................................................................CRITICAL:root:twint.run:Twint:Feed:noDataExpecting value: line 1 column 1 (char 0) CRITICAL:root:twint.run:Twint:Feed:noDataExpecting value: line 1 column 1 (char 0) CRITICAL:root:twint.run:Twint:Feed:noDataExpecting value: line 1 column 1 (char 0) CRITICAL:root:twint.run:Twint:Feed:noDataExpecting value: line 1 column 1 (char 0) CRITICAL:root:twint.run:Twint:Feed:noDataExpecting value: line 1 column 1 (char 0) CRITICAL:root:twint.run:Twint:Feed:noDataExpecting value: line 1 column 1 (char 0) CRITICAL:root:twint.run:Twint:Feed:noDataExpecting value: line 1 column 1 (char 0) ....................CRITICAL:root:twint.run:Twint:Feed:noDataExpecting value: line 1 column 1 (char 0) CRITICAL:root:twint.run:Twint:Feed:noDataExpecting value: line 1 column 1 (char 0) CRITICAL:root:twint.run:Twint:Feed:noDataExpecting value: line 1 column 1 (char 0) ............................................................CRITICAL:root:twint.run:Twint:Feed:noDataExpecting value: line 1 column 1 (char 0) CRITICAL:root:twint.run:Twint:Feed:noDataExpecting value: line 1 column 1 (char 0) CRITICAL:root:twint.run:Twint:Feed:noDataExpecting value: line 1 column 1 (char 0) CRITICAL:root:twint.run:Twint:Feed:noDataExpecting value: line 1 column 1 (char 0) ..........................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................CRITICAL:root:twint.run:Twint:Feed:noDataExpecting value: line 1 column 1 (char 0) CRITICAL:root:twint.run:Twint:Feed:noDataExpecting value: line 1 column 1 (char 0) ............................................................................................................................................................................................................................................................................................................................................................................................CRITICAL:root:twint.run:Twint:Feed:noDataExpecting value: line 1 column 1 (char 0) CRITICAL:root:twint.run:Twint:Feed:noDataExpecting value: line 1 column 1 (char 0) CRITICAL:root:twint.run:Twint:Feed:noDataExpecting value: line 1 column 1 (char 0) CRITICAL:root:twint.run:Twint:Feed:noDataExpecting value: line 1 column 1 (char 0) CRITICAL:root:twint.run:Twint:Feed:noDataExpecting value: line 1 column 1 (char 0) CRITICAL:root:twint.run:Twint:Feed:noDataExpecting value: line 1 column 1 (char 0) CRITICAL:root:twint.run:Twint:Feed:noDataExpecting value: line 1 column 1 (char 0) ............................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................CRITICAL:root:twint.run:Twint:Feed:noDataExpecting value: line 1 column 1 (char 0) CRITICAL:root:twint.run:Twint:Feed:noDataExpecting value: line 1 column 1 (char 0) CRITICAL:root:twint.run:Twint:Feed:noDataExpecting value: line 1 column 1 (char 0) ............................................................................................................................................................................................................................................................................................................................................................................................CRITICAL:root:twint.run:Twint:Feed:noDataExpecting value: line 1 column 1 (char 0) CRITICAL:root:twint.run:Twint:Feed:noDataExpecting value: line 1 column 1 (char 0) CRITICAL:root:twint.run:Twint:Feed:noDataExpecting value: line 1 column 1 (char 0) CRITICAL:root:twint.run:Twint:Feed:noDataExpecting value: line 1 column 1 (char 0) CRITICAL:root:twint.run:Twint:Feed:noDataExpecting value: line 1 column 1 (char 0) ................................................................................................................................................................................................................................................CRITICAL:root:twint.run:Twint:Feed:noDataExpecting value: line 1 column 1 (char 0) CRITICAL:root:twint.run:Twint:Feed:noDataExpecting value: line 1 column 1 (char 0) CRITICAL:root:twint.run:Twint:Feed:noDataExpecting value: line 1 column 1 (char 0) CRITICAL:root:twint.run:Twint:Feed:noDataExpecting value: line 1 column 1 (char 0) CRITICAL:root:twint.run:Twint:Feed:noDataExpecting value: line 1 column 1 (char 0) CRITICAL:root:twint.run:Twint:Feed:noDataExpecting value: line 1 column 1 (char 0) CRITICAL:root:twint.run:Twint:Feed:noDataExpecting value: line 1 column 1 (char 0) CRITICAL:root:twint.run:Twint:Feed:noDataExpecting value: line 1 column 1 (char 0) CRITICAL:root:twint.run:Twint:Feed:noDataExpecting value: line 1 column 1 (char 0) CRITICAL:root:twint.run:Twint:Feed:noDataExpecting value: line 1 column 1 (char 0) CRITICAL:root:twint.run:Twint:Feed:Tweets_known_error:Expecting value: line 1 column 1 (char 0) Expecting value: line 1 column 1 (char 0) [x] run.Feed [!] if get this error but you know for sure that more tweets exist, please open an issue and we will investigate it! [+] Finished: Successfully collected 2718 Tweets.

twint -s meloni --since 2019-01-01 --until 2019-12-25 --stats --count -o meloni.json

Its working good and i launched before modifying the user agent file, but is not sending in elasticsearch db. in the search there are no points but tweet text as shell graphic.

can be the problem the elastic search?

Aassifh commented 4 years ago

Downgraded to the version 2.1.6 and it's working !

pielco11 commented 4 years ago

Thanks @Aassifh for notifying us, I'll do some tests!

AldebaranNapoli commented 4 years ago

how did u do downgrade?

Aassifh commented 4 years ago

pip install twint==2.1.6

AldebaranNapoli commented 4 years ago

i install 2 and works finally ... the problem is the last version with elasticsearch so ...

UpasanaDutta98 commented 4 years ago

I was facing this issue when I travelled from Boulder to India. I was constantly shown the warning/error - 'CRITICAL:root:twint.run:Twint:Feed:noDataExpecting value: line 1 column 1 (char 0)' and the output was inconsistent. I tried many different stuff, but I kept facing this error in India. When I was back in Boulder, I ran the same piece of code again, and guess what? It ran without any errors!

So I feel that at least in my case, it was some regional setting of the network w.r.t accessing Twitter data which was causing this error, I can't say in detail though.

Aassifh commented 4 years ago

Maybe its has something to do with the timezone ! check it !

edsu commented 4 years ago

It sounds like you are battling a machine learning process that's trying to detect scraping.

nonameable commented 4 years ago

I'm starting to think the same thing. @edsu.

pielco11 commented 4 years ago

I would not exclude that, I think that's totally possible

o7n commented 4 years ago

This feels more like simple rate limiting or capacity management mechanisms.

pielco11 commented 4 years ago

Might be that the detector uses a specific algorithm, that may run in a ML process or not, and the effect is rate limiting

o7n commented 4 years ago

And how would it be more effective than just rate limiting on its own? This is all just speculation, there is not a single piece of evidence for a ML process.

pielco11 commented 4 years ago

@o7n actually nobody said for sure that behind this issue there's a ML process. We are just sharing our thoughts

o7n commented 4 years ago

I’m just sharing my thoughts as well.

GeekOnAcid commented 4 years ago

I know its a fresh issue, but any more ideas for a solution? I've tried downgrading Twint version and it didn't work. When I can scrape between 8-10MB of raw tweets to CSV file it displays the discussed error. PS: awesome tool!

pbabvey commented 4 years ago

I used to see this message while collecting tweets: CRITICAL:root:twint.run:Twint:Feed:noDataExpecting value: line 1 column 1 (char 0) But the messages were showing up intermittently and the process continued with no issues. Now, the process stops after receiving the first message while sending a sequence of the above messages. Then, for ~1 minute the message keep showing up and after that Twint successfully can collect tweets like before getting those messages again. I wonder if we can hibernate the process for a few seconds and start over the process after. Is it possible? I didn't look into the codes yet.

pbabvey commented 4 years ago

As a temporary solution, I tried this and then the collector did not fell off. I change the part of the twint code in run.py before getting a random UserAgent.

                if consecutive_errors_count < self.config.Retries_count:
                    #################################
                    delay = random.randint(60, 120)
                    print('sleeping for {} secs'.format(delay))
                    time.sleep(delay)
                    #################################
                    self.user_agent = await get.RandomUserAgent()

                    continue
                logme.critical(__name__+':Twint:Feed:Tweets_known_error:' + str(e))
                print(str(e) + " [x] run.Feed")
                print("[!] if get this error but you know for sure that more tweets exist, please open an issue and we will investigate it!")
                break

It seems the collector is able to start over from the place it left off after 2 min. I know this is totally inefficient. So, please share your better solution to counteract the Twitter limitations.

zoltanpm commented 4 years ago

So, please share your better solution to counteract the Twitter limitations.

I've also run into this error for larger scrapes. If it's, in fact, an issue with rate-limiting from twitter, what would it take to implement a config option that slows down requests by a set # of seconds (e.g., new tweet is scraped every 3 seconds)?