Migrate to GNIP powertrack API v2

matthewberryman commented 7 years ago

http://support.gnip.com/gnip2.0/

matthewberryman commented 7 years ago

Key issue is dealing with backfill behaviour change, and this is a bit more complex than I realised at first.

tomasholderness commented 7 years ago

One thought (and it may be worth asking Gnip), do Twitter IDs always arrive in order / increment positively? Then perhaps we could just store the last ID received, if at a later point in time we receive an ID greater than this we know we haven't processed it?

matthewberryman commented 7 years ago

They do AFAIK but that doesn't solve the problem. E.g. if you store it but then the program terminates before processing, or vice versa, you still have the same issue. IMHO we need to update the schemas so that we always store ID consistently and then update the queries so it's one atomic operation in SQL when we store things, but then we'd need to have a more complex query for checking to see if we've seen it before.

tomasholderness commented 7 years ago

Two thoughts: (1) Our program terminating mid-process was never a user-case of PowerTrack v1 backfill anyway though? If PowerTrack pushed us a tweet, but our process died part way through processing I wouldn't expect receive the tweet again on reconnect.

(2) Storing tweet IDs is always something we've tried to avoid, hence why I'm hesitant.

matthewberryman commented 7 years ago

(1) That's true. It only helped us catch tweets while the system was down. There was still always the case that we'd received it and then didn't process it. So here, to regain that behaviour, we only need to make sure we don't reprocess tweets we've already reprocessed. I need to restructure things using callbacks but that's doable here and doesn't involve replicating changes to SQL calls in the non-powertrack code.

(2) I don't see that as the show stopped to doing that, rather the need to also modify the non-powertrack Twitter code (but see my comment above on (1)).

I'll start on these changes tomorrow my time.

tomasholderness commented 7 years ago

In that case, why don't we just create a persistent store of the last ID processed? We could keep a copy both in memory and postgres? Then a query/function to check incoming IDs against last processed could head up the filter function?

matthewberryman commented 7 years ago

Yes, that makes sense, and makes the logic simpler. I'll proceed with coding up that when I start tomorrow.

matthewberryman commented 7 years ago

Ok, @talltom, I think I have made the changes required.

There are some changes to our private config for the API URLs.
There's a table for storing the last seen ID https://github.com/smart-facility/cognicity-schema/blob/gnip2/schema.sql#L200 and the initialisation of that table in https://github.com/smart-facility/cognicity-schema/blob/gnip2/schema.sql#L201
Since this is specific to cognicity-reports-powertrack, I have made most of the changes there except as noted above plus the addition of a gnip2 branch of cognicity-reports just to hold an updated reference to the submodule (helps with travis plus when we eventually merge changes back in) but no other changes in cognicity-reports.
With some work, it now passes the tests—I modified some to ignore the tweet ID checking for now but we will want to add that back in to some of the tests in future.

Just a note that I deleted earlier cognicity-reports and cognicity-reports-powertrack gnip2 branches and started again following our rethink (it was just easier to start from master).

Check the changes in https://github.com/smart-facility/cognicity-reports-powertrack/compare/gnip2

matthewberryman commented 7 years ago

Merged, tested and working.

smart-facility / cognicity-reports-powertrack

Migrate to GNIP powertrack API v2 #27