Closed toland closed 4 years ago
Progress update
We have a rough consensus for using DynamoDB as the store for the tweet data. Redis is too ephemeral. Elasticsearch is an option we can look into if DynamoDB has too many issues.
In terms of ingesting the Twitter data, I'm thinking that we can use the filtered realtime tweet stream API with filters for the geographic areas where the team members live. This will give us some initial data to work with, though it will only be public tweets that have been geocoded. One outstanding question is how much relevant data this will produce.
Rather than build the ingestion code into Wocky, we're thinking that it would be better to use a Lambda function. The Twitter libraries for Lambda languages like Python and JavaScript may be better than the Elixir libraries, and there is some value to separating the ingestion from the normal Wocky server.
Actually, this may be an opportunity to take advantage of some of AWS's more exotic services. It might make sense to use Kinesis as the first step in the ingestion process.
I have some thoughts to add regarding how we determine relevance for a tweet.
I see the relevance calculation taking place in two steps: first calculating a global relevance score during ingestion, then calculating a local relevance score before displaying the tweet to the user.
During ingestion, we could calculate a global relevance score based on whether the tweet was from a verified account, how many times it was retweeted, whether it was trending, etc. The score should be stored with the tweet.
Then, when it is time to display the tweet, we would rank the tweets by score and pull the highest rated. Then we would apply the local relevance criteria, such as whether the tweet was authored by someone the user follows, was retweeted by someone the user follows, or was authored by someone the user has blocked. The local relevance calculation would require the user to link their Twitter account to us, as it is against the Twitter TOS for us to use any information about the user's Twitter account unless they have explicitly linked the account and given us permission.
This is very rough and hand-wavey. There are a lot of details to work out, but this gives us some kind of starting point to begin the discussion of determining relevance.
Progress update
I applied for an organizational Twitter development account today. It was approved at the end of the day, so I can start working on the Twitter integration on Monday.
There is an AWS reference architecture for doing exactly what we want to do: https://github.com/aws-samples/lambda-refarch-streamprocessing
My plan is to use that as a starting point and begin feeding tweets into Dynamo via Kinesis and Lambda on Monday.
Twitter has very specific requirements for how tweets are displayed: https://developer.twitter.com/en/developer-terms/display-requirements.html.
We can emulate the layout that they prefer, or we can use an embedded web view to render the tweets. The latter option might be ideal. In that case, the server can pass a list of relevant tweet ids to the client, and it can use the embedded web view to render them, cutting down significantly on the amount of data that the server has to pass to the client.
I have tweet ingestion via Kinesis to DynamoDB working, and it is quite nice. I used the reference architecture mentioned above, and I was able to tweak it easily so that it pulled geotagged public tweets from San Francisco in real-time and loaded them into DDB.
Of course, most of them were noise. The next big step, and the really hard one, is to determine relevance.
At this point, we are moving away from Twitter as the primary data source. There is still valuable information there, but sifting the signal from the noise is a big task, and perhaps more than we want to tackle just yet.
Instead, we're shifting focus to more formal sources of information. Things like official weather bulletins and government-published disaster warnings. Additionally, we should think about allowing users to submit relevant information to the app similar to how Waze handles traffic conditions.
This means that the current research priority needs to be finding these formal, or official, information sources and figuring out how to ingest them into the app.
The National Weather Service (NWS) has a feed of issued alerts available here. This seems like a good place to start for a prototype.
The concept has narrowed and evolved significantly since I opened this ticket. I'm going to close this one and open a new one to track the development of the weather alert feature.
We want to create a new feature that shows our users local, relevant news in the TinyRobot app. This is still at the "big idea" stage and the requirements are rough.
Our immediate goal is a prototype that will prove out the concept. There are three major areas that need to be proved:
Area 1 will require learning the Twitter API and what it can provide (docs).
While we may decide to use another language to ingest the data, I'm going to assume Elixir for now. I found three Twitter clients for Elixir:
Area 2 will require a location-aware data source. We have three options available via AWS:
We already use Redis extensively, but given the volume of data, it may make more sense to look into DynamoDB or Elasticsearch.
Two concepts that keep coming up in my research of this area are GeoJSON and Geohashing.
Area 3 is perhaps the easiest. For the prototype, we can create ephemeral bots that contain the news and inject them into the results of the local bots API call.