"Also monitor for all future tweets" sometimes doesn't work

scraperwiki / twitter-search-tool

ScraperWiki tool to get Tweets matching a search term; tool now defunct, though the code is here for reference.

https://blog.scraperwiki.com/2014/08/the-story-of-getting-twitter-data-and-its-missing-middle/

Other

9 stars 7 forks source link

"Also monitor for all future tweets" sometimes doesn't work #19

Closed frabcus closed 10 years ago

frabcus commented 10 years ago

I think this is a concurrency bug.

Basically, this command is running in the background after the Twitter auth happens:

ONETIME=1 tool/twsearch.py

The user checks the box to also monitor future tweets, and it spawns this command.

MODE=monitoring ONETIME=1 tool/twsearch.py ...

The second one sets a crontab and exits before the first has finished. Then the finished clears the crontab.

So we are left with no crontab.

frabcus commented 10 years ago

Solution is probably to always run twsearch.py in flock? Or perhaps to kill before changing mode. Pretty yucky either way

frabcus commented 10 years ago

Workaround - uncheck and check the monitoring checkbox.

frabcus commented 10 years ago

Weird, I haven't seen this problem for a while.

drj11 commented 10 years ago

I have seen a few intercom reports recently that are consistent with this. Didn't know there was an issue for it.

frabcus commented 10 years ago

Another case: https://app.intercom.io/apps/63b0c6d4bb5f0867b6e93b0be9b569fb3a7ab1e3/conversations/381289936

I've added it as a bug to the "missing tweets" Trello card.

frabcus commented 10 years ago

This in twsearch.py will wipe the crontab always when backlog clear, but if you've ticked monitor you'll be in mode 'monitoring' so shouldn't be a problem.

    # we've reached as far back as we'll ever get, so we're done forever
    if not onetime and mode == 'clearing-backlog':
        mode = 'backlog-cleared'
        os.system("crontab -r >/dev/null 2>&1")
        set_status_and_exit("ok-updating", 'ok', '')

frabcus commented 10 years ago

We think this needs more logging.

See https://github.com/scraperwiki/twitter-search-tool/issues/28

frabcus commented 10 years ago

So, the sequence is:

0) User enters search term, auths with Twitter, then the initial "onetime" run happens fairly quickly. Then the user interface refreshes to the "Monitoring for new tweets matching ... " page. 1) Main background scraper process A starts with mode 'clearing-backlog' in its memory. 2) User clicks on "monitor future tweets", spawning a second onetime scraper B. 3) B finishes fairly quickly, saving the mode 'monitoring' to the database as user intended. 4) Some while later, A finishes, saving the mode 'clearing-backlog' to the database.

I can reproduce this fairly easily in the user interface.

(see comment here https://github.com/scraperwiki/twitter-search-tool/issues/27#issuecomment-42317401)

frabcus commented 10 years ago

@drj11 - I've reproduced this clearly now, with explanation for why it happens. Can you a) propose a short term fix, b) suggest how we should write tools in future to get rid of these problems / make code simpler

@pwaller should we be using something like jobly (https://github.com/scraperwiki/cobalt/pull/44) for this?

pwaller commented 10 years ago

Am I understanding correctly: it's a straight race between the "initial" process and the subsequent updates?

I'm still not yet sure what the ideal setup looks like and whether it involves jobly or not. We should figure out exactly what we need here first.

Would it help if all tasks "of a given name" were serialised server-side, regardless of whether they were user or cron initiated?

frabcus commented 10 years ago

Yes.

The initial process, and one that the user is using to change the state.

Another idea that might help:

Use the database more intensively rather than variables. So let Javascript write directly to it, for example. And make things not assume its value.

Francis

On Wed, May 07, 2014 at 01:42:54AM -0700, Peter Waller wrote:

Am I understanding correctly: it's a straight race between the "initial" process and the subsequent updates?

I'm still not yet sure what the ideal setup looks like and whether it involves jobly or not. We should figure out exactly what we need here first.

Would it help if all tasks "of a given name" were serialised server-side, regardless of whether they were user or cron initialised?

Reply to this email directly or view it on GitHub: https://github.com/scraperwiki/twitter-search-tool/issues/19#issuecomment-42402691

pwaller commented 10 years ago

I wonder if we should have some sort of exclusion as a core feature. "Only one of a given named thing allowed to run at once", for example. It might be a way of avoiding a lot of hard to find races.

drj11 commented 10 years ago

I like @pwaller's idea of building exclusion into the API. maybe instead of POSTing to /exec you could POST to /exec/q1 and it would fail if something was already running "at" the /exec/q1 endpoint.