scraperwiki / twitter-search-tool

ScraperWiki tool to get Tweets matching a search term; tool now defunct, though the code is here for reference.
https://blog.scraperwiki.com/2014/08/the-story-of-getting-twitter-data-and-its-missing-middle/
Other
9 stars 7 forks source link

max(id) / min(id) will make holes #27

Closed frabcus closed 10 years ago

frabcus commented 10 years ago

It gets the Tweets above/below max/min, and then saves them.

If that save crashes midway through, it could leave a hole in Tweets.

frabcus commented 10 years ago

This needs verifying, as dumptruck does try to save all the data in one transaction.

frabcus commented 10 years ago

Also note that we're doing min/max on a str - not a 64 bit int which is what Twitter ids really are. Not sure if this matters in this context.

frabcus commented 10 years ago
frabcus commented 10 years ago

The fragment of log is interesting (more in ~g5fzncq/20140506-all.log on premium) - @drj11 pressed the monitor button lots of times.

It shows lots of interleaved instances running, which is suspicious.

It explains https://github.com/scraperwiki/twitter-search-tool/issues/19 I think. These are the last few set_status_and exit lines. They are interleaved with a big time gap, and essentially leave it in the status it was when the first process started (i.e. just clearing backlog, not monitoring)

2014-05-06T15:18:13.885833 12321 set_status_and_exit mode='monitoring', status='ok-updating', type='ok', message=''
2014-05-06T15:18:14.082798 12381 set_status_and_exit mode='clearing-backlog', status='ok-updating', type='ok', message=''
2014-05-06T15:21:17.973979 11796 set_status_and_exit mode=u'clearing-backlog', status='ok-updating', type='ok', message=''
frabcus commented 10 years ago

Going back to the original hypothesis that this is dumptruck crashing. Using this test.py script:

#!/usr/bin/python

import scraperwiki

datas = []
for x in range(0, 10000):
    data = { 'id_str' : x, 'name': 'mouse' }
    datas.append(data)

print scraperwiki.sql.save(['id_str'], datas, table_name="tweets")

It takes 4 to 5 seconds to run. If I interrupt its run brutally after two seconds with gtimeout, whether set to use SIGTERM or SIGKILL, the end database ends up with nothing in .

bat:sqlitekilltest francis$ rm scraperwiki.sqlite; time ./test.py 
None

real    0m4.633s
user    0m4.462s
sys 0m0.127s
bat:sqlitekilltest francis$ echo "select count(*) from tweets;" | sqlite3 scraperwiki.sqlite
10000
bat:sqlitekilltest francis$ rm scraperwiki.sqlite; gtimeout 2 time ./test.py 
bat:sqlitekilltest francis$ echo "select count(*) from tweets;" | sqlite3 scraperwiki.sqlite
0
bat:sqlitekilltest francis$ rm scraperwiki.sqlite; gtimeout --signal=9 2 time ./test.py 
Killed: 9
bat:sqlitekilltest francis$ echo "select count(*) from tweets;" | sqlite3 scraperwiki.sqlite
0

I feel at least confident that dumptruck is correctly using transactions, so from this point of view I wouldn't expect locking errors (e.g. with web interface also accessing the SQLite database) or other crashing errors mid-save to cause missing tweets.

frabcus commented 10 years ago

Example of missing tweets with logging available:

https://app.intercom.io/apps/63b0c6d4bb5f0867b6e93b0be9b569fb3a7ab1e3/conversations/392931884

frabcus commented 10 years ago

From that last example the user reports a gap betwee 2014-05-11 20:04:40 and 2014-05-11 21:00:57. This is the number of Tweets per minute create_at in that period.

2014-05-11 17:01|4
2014-05-11 17:02|6
2014-05-11 17:03|2
2014-05-11 18:02|12
2014-05-11 18:03|3
2014-05-11 19:04|15
2014-05-11 20:04|15
2014-05-11 21:00|1
2014-05-11 21:01|4
2014-05-11 21:02|4
2014-05-11 21:03|5
pwaller commented 10 years ago

Note: I'm not thinking about this one much, let me know if you'd like to direct me at it.

frabcus commented 10 years ago

Some ideas:

frabcus commented 10 years ago

It is a card in this sprint, so yes, but finish the current card first :)

On Tue, May 13, 2014 at 01:08:20AM -0700, Peter Waller wrote:

Note: I'm not thinking about this one much, let me know if you'd like to direct me at it.


Reply to this email directly or view it on GitHub: https://github.com/scraperwiki/twitter-search-tool/issues/27#issuecomment-42928032

frabcus commented 10 years ago

Obvious things to progress are 1) somehow change our algorithm using the search API, 2) change the result_type, 3) change to streaming API filter

frabcus commented 10 years ago

Investigating 2) change the result_type

This is doing a search for Kitten with each of four different result_types, searching for kitten. Note that from looking at the ids the default seems to be recent.

hbrsh2y@cobalt-p:~$ ./tool/twfiddle.py 
--- result_type default 15 [466149635199750144, 466149662861168640, 466149665852104704, 466149673611186176, 466149679785603072, 466149708428496896, 466149813398953984, 466149815223463937, 466149816045535232, 466149826116087808, 466149838438924288, 466149854599577600, 466149866763071489, 466149875122335744, 466149882013569024]
--- result_type recent 15 [466149635199750144, 466149662861168640, 466149665852104704, 466149673611186176, 466149679785603072, 466149708428496896, 466149813398953984, 466149815223463937, 466149816045535232, 466149826116087808, 466149838438924288, 466149854599577600, 466149866763071489, 466149875122335744, 466149882013569024]
--- result_type mixed 30 [465578430306607104, 465583786332528641, 465611760469168128, 465615396754579456, 465632106149056512, 465753693883600896, 465833788845862913, 465862724082085889, 465907064439861249, 465921610411954176, 465937488213991424, 465944294894100480, 466059886527139840, 466097774165905408, 466145951116038144, 466149635199750144, 466149662861168640, 466149665852104704, 466149673611186176, 466149679785603072, 466149708428496896, 466149813398953984, 466149815223463937, 466149816045535232, 466149826116087808, 466149838438924288, 466149854599577600, 466149866763071489, 466149875122335744, 466149882013569024]
--- result_type popular 15 [465578430306607104, 465583786332528641, 465611760469168128, 465615396754579456, 465632106149056512, 465753693883600896, 465833788845862913, 465862724082085889, 465907064439861249, 465921610411954176, 465937488213991424, 465944294894100480, 466059886527139840, 466097774165905408, 466145951116038144]
frabcus commented 10 years ago

OK, so investigating how since_id, max_id and recent work in dis-harmony.

In short... When you do since_id with a recent type search (the default), you always get the most recent tweets. So it will leave you a gap - it doesn't start going forward from since_id just makes sure it doesn't go back past it.

Whereas in contrast max_id with a recent type search always returns you the Tweet you specified in max_id as the last one in the array you get back.

So we need to make the algorithm more.... okazaki!

frabcus commented 10 years ago

Working on branch https://github.com/scraperwiki/twitter-search-tool/tree/fix-min-max-bug

frabcus commented 10 years ago

See pull request https://github.com/scraperwiki/twitter-search-tool/pull/38