Closed frabcus closed 10 years ago
This needs verifying, as dumptruck does try to save all the data in one transaction.
Also note that we're doing min/max on a str - not a 64 bit int which is what Twitter ids really are. Not sure if this matters in this context.
The fragment of log is interesting (more in ~g5fzncq/20140506-all.log on premium) - @drj11 pressed the monitor button lots of times.
It shows lots of interleaved instances running, which is suspicious.
It explains https://github.com/scraperwiki/twitter-search-tool/issues/19 I think. These are the last few set_status_and exit lines. They are interleaved with a big time gap, and essentially leave it in the status it was when the first process started (i.e. just clearing backlog, not monitoring)
2014-05-06T15:18:13.885833 12321 set_status_and_exit mode='monitoring', status='ok-updating', type='ok', message=''
2014-05-06T15:18:14.082798 12381 set_status_and_exit mode='clearing-backlog', status='ok-updating', type='ok', message=''
2014-05-06T15:21:17.973979 11796 set_status_and_exit mode=u'clearing-backlog', status='ok-updating', type='ok', message=''
Going back to the original hypothesis that this is dumptruck crashing. Using this test.py
script:
#!/usr/bin/python
import scraperwiki
datas = []
for x in range(0, 10000):
data = { 'id_str' : x, 'name': 'mouse' }
datas.append(data)
print scraperwiki.sql.save(['id_str'], datas, table_name="tweets")
It takes 4 to 5 seconds to run. If I interrupt its run brutally after two seconds with gtimeout, whether set to use SIGTERM or SIGKILL, the end database ends up with nothing in .
bat:sqlitekilltest francis$ rm scraperwiki.sqlite; time ./test.py
None
real 0m4.633s
user 0m4.462s
sys 0m0.127s
bat:sqlitekilltest francis$ echo "select count(*) from tweets;" | sqlite3 scraperwiki.sqlite
10000
bat:sqlitekilltest francis$ rm scraperwiki.sqlite; gtimeout 2 time ./test.py
bat:sqlitekilltest francis$ echo "select count(*) from tweets;" | sqlite3 scraperwiki.sqlite
0
bat:sqlitekilltest francis$ rm scraperwiki.sqlite; gtimeout --signal=9 2 time ./test.py
Killed: 9
bat:sqlitekilltest francis$ echo "select count(*) from tweets;" | sqlite3 scraperwiki.sqlite
0
I feel at least confident that dumptruck is correctly using transactions, so from this point of view I wouldn't expect locking errors (e.g. with web interface also accessing the SQLite database) or other crashing errors mid-save to cause missing tweets.
Example of missing tweets with logging available:
https://app.intercom.io/apps/63b0c6d4bb5f0867b6e93b0be9b569fb3a7ab1e3/conversations/392931884
From that last example the user reports a gap betwee 2014-05-11 20:04:40 and 2014-05-11 21:00:57. This is the number of Tweets per minute create_at in that period.
2014-05-11 17:01|4
2014-05-11 17:02|6
2014-05-11 17:03|2
2014-05-11 18:02|12
2014-05-11 18:03|3
2014-05-11 19:04|15
2014-05-11 20:04|15
2014-05-11 21:00|1
2014-05-11 21:01|4
2014-05-11 21:02|4
2014-05-11 21:03|5
Note: I'm not thinking about this one much, let me know if you'd like to direct me at it.
Some ideas:
result_type
property set to recent
<a href="https://dev.twitter.com/docs/api/1.1/post/statuses/filter">statuses/filter</a>
(may need new infrastructure to do, not sure can stream separately from each box for one user)It is a card in this sprint, so yes, but finish the current card first :)
On Tue, May 13, 2014 at 01:08:20AM -0700, Peter Waller wrote:
Note: I'm not thinking about this one much, let me know if you'd like to direct me at it.
Reply to this email directly or view it on GitHub: https://github.com/scraperwiki/twitter-search-tool/issues/27#issuecomment-42928032
Obvious things to progress are 1) somehow change our algorithm using the search API, 2) change the result_type, 3) change to streaming API filter
Investigating 2) change the result_type
This is doing a search for Kitten with each of four different result_types, searching for kitten. Note that from looking at the ids the default seems to be recent
.
hbrsh2y@cobalt-p:~$ ./tool/twfiddle.py
--- result_type default 15 [466149635199750144, 466149662861168640, 466149665852104704, 466149673611186176, 466149679785603072, 466149708428496896, 466149813398953984, 466149815223463937, 466149816045535232, 466149826116087808, 466149838438924288, 466149854599577600, 466149866763071489, 466149875122335744, 466149882013569024]
--- result_type recent 15 [466149635199750144, 466149662861168640, 466149665852104704, 466149673611186176, 466149679785603072, 466149708428496896, 466149813398953984, 466149815223463937, 466149816045535232, 466149826116087808, 466149838438924288, 466149854599577600, 466149866763071489, 466149875122335744, 466149882013569024]
--- result_type mixed 30 [465578430306607104, 465583786332528641, 465611760469168128, 465615396754579456, 465632106149056512, 465753693883600896, 465833788845862913, 465862724082085889, 465907064439861249, 465921610411954176, 465937488213991424, 465944294894100480, 466059886527139840, 466097774165905408, 466145951116038144, 466149635199750144, 466149662861168640, 466149665852104704, 466149673611186176, 466149679785603072, 466149708428496896, 466149813398953984, 466149815223463937, 466149816045535232, 466149826116087808, 466149838438924288, 466149854599577600, 466149866763071489, 466149875122335744, 466149882013569024]
--- result_type popular 15 [465578430306607104, 465583786332528641, 465611760469168128, 465615396754579456, 465632106149056512, 465753693883600896, 465833788845862913, 465862724082085889, 465907064439861249, 465921610411954176, 465937488213991424, 465944294894100480, 466059886527139840, 466097774165905408, 466145951116038144]
OK, so investigating how since_id
, max_id
and recent
work in dis-harmony.
In short... When you do since_id
with a recent
type search (the default), you always get the most recent tweets. So it will leave you a gap - it doesn't start going forward from since_id
just makes sure it doesn't go back past it.
Whereas in contrast max_id
with a recent
type search always returns you the Tweet you specified in max_id
as the last one in the array you get back.
So we need to make the algorithm more.... okazaki!
Working on branch https://github.com/scraperwiki/twitter-search-tool/tree/fix-min-max-bug
See pull request https://github.com/scraperwiki/twitter-search-tool/pull/38
It gets the Tweets above/below max/min, and then saves them.
If that save crashes midway through, it could leave a hole in Tweets.