Closed frabcus closed 10 years ago
Solution is probably to always run twsearch.py in flock? Or perhaps to kill before changing mode. Pretty yucky either way
Workaround - uncheck and check the monitoring checkbox.
Weird, I haven't seen this problem for a while.
I have seen a few intercom reports recently that are consistent with this. Didn't know there was an issue for it.
Another case: https://app.intercom.io/apps/63b0c6d4bb5f0867b6e93b0be9b569fb3a7ab1e3/conversations/381289936
I've added it as a bug to the "missing tweets" Trello card.
This in twsearch.py will wipe the crontab always when backlog clear, but if you've ticked monitor you'll be in mode 'monitoring' so shouldn't be a problem.
# we've reached as far back as we'll ever get, so we're done forever
if not onetime and mode == 'clearing-backlog':
mode = 'backlog-cleared'
os.system("crontab -r >/dev/null 2>&1")
set_status_and_exit("ok-updating", 'ok', '')
We think this needs more logging.
See https://github.com/scraperwiki/twitter-search-tool/issues/28
So, the sequence is:
0) User enters search term, auths with Twitter, then the initial "onetime" run happens fairly quickly. Then the user interface refreshes to the "Monitoring for new tweets matching ... " page. 1) Main background scraper process A starts with mode 'clearing-backlog' in its memory. 2) User clicks on "monitor future tweets", spawning a second onetime scraper B. 3) B finishes fairly quickly, saving the mode 'monitoring' to the database as user intended. 4) Some while later, A finishes, saving the mode 'clearing-backlog' to the database.
I can reproduce this fairly easily in the user interface.
(see comment here https://github.com/scraperwiki/twitter-search-tool/issues/27#issuecomment-42317401)
@drj11 - I've reproduced this clearly now, with explanation for why it happens. Can you a) propose a short term fix, b) suggest how we should write tools in future to get rid of these problems / make code simpler
@pwaller should we be using something like jobly (https://github.com/scraperwiki/cobalt/pull/44) for this?
Am I understanding correctly: it's a straight race between the "initial" process and the subsequent updates?
I'm still not yet sure what the ideal setup looks like and whether it involves jobly or not. We should figure out exactly what we need here first.
Would it help if all tasks "of a given name" were serialised server-side, regardless of whether they were user or cron initiated?
Yes.
The initial process, and one that the user is using to change the state.
Another idea that might help:
Francis
On Wed, May 07, 2014 at 01:42:54AM -0700, Peter Waller wrote:
Am I understanding correctly: it's a straight race between the "initial" process and the subsequent updates?
I'm still not yet sure what the ideal setup looks like and whether it involves jobly or not. We should figure out exactly what we need here first.
Would it help if all tasks "of a given name" were serialised server-side, regardless of whether they were user or cron initialised?
Reply to this email directly or view it on GitHub: https://github.com/scraperwiki/twitter-search-tool/issues/19#issuecomment-42402691
I wonder if we should have some sort of exclusion as a core feature. "Only one of a given named thing allowed to run at once", for example. It might be a way of avoiding a lot of hard to find races.
I like @pwaller's idea of building exclusion into the API. maybe instead of POSTing to /exec
you could POST to /exec/q1
and it would fail if something was already running "at" the /exec/q1
endpoint.
I think this is a concurrency bug.
Basically, this command is running in the background after the Twitter auth happens:
The user checks the box to also monitor future tweets, and it spawns this command.
The second one sets a crontab and exits before the first has finished. Then the finished clears the crontab.
So we are left with no crontab.