Writing Package Results to S3 costs a Fortune! (we need to re-think it)

nelsonic commented 8 years ago

Yesterday we did a quick calculation of how much writing directly to S3 would cost at "Peak" load. Our calculations used the following factors:

Number of searches per second: 300
Number of packages returned per search: 1000
Cost per 1,000 POST requests (to S3): $0.005 https://aws.amazon.com/s3/pricing/
activity period (when users are actively using the app): 12h (12h x 60min x 60sec = ) 43,200 seconds

So, during "Peak" Load we expect to be writing 300,000 packages per second at a cost of $0.01 per 1000 writes this is $3 per second on S3 for writes alone. 😮 This would be (300,000 writes-per-second x 43,200 seconds)/1000 x $0.005 = _$64,800 Per Day!_

To get around this rediculous cost, we suggest writing the records into Redis (ElastiCache) until the slowest provider has returned it's results then batch-write all the packages in a single file to S3. This will bring the cost down to:

(300 requests-per-second x 1 (ONE) write to S3 x 43,200 seconds)/1000 x $0.005 = _$64.80 Per Day!_ (a much more acceptable figure...)

See: https://tc-jira.atlassian.net/browse/ISEARCH-282

nelsonic commented 8 years ago

@Kumjami suggested writing records directly into ELK this would be possible: http://stackoverflow.com/questions/27388521/elasticsearch-high-indexing-throughput and potentially much cheaper... Requires further investigation

jruts commented 8 years ago

I do not think we need s3 yet. One of the key points of Federated Search is that we will call the AtCore and MC services directly so that we will have nearly real time price & availability.

A real db where we can perform real database operations (like sorting etc) might be more beneficial. (Suggested by @lennym ).

I know that the plan for storing the data in s3 is to do some analytics later on. I personally think that we do not need to care about this since we will not do this for the Prototype, the Pilot or the first steps of Production. So instead of dragging along this extra complexity I would just leave it for what it is until we have real requirements for doing it.

nelsonic commented 8 years ago

@lennym (reluctantly) suggested MongoDB as the place to store docs. I think we need to gather the requirements and understand what our use-case then draw up a short-list of the available options. I'd like to put _RethinkDB_ on the list to be considered simply because streaming is built-in so we could simplify the backend SDK... see: https://www.rethinkdb.com/faq/ & https://www.rethinkdb.com/docs/comparison-tables/

lennym commented 8 years ago

I'd like to put RethinkDB on the list to be considered simply because streaming is built-in

This is a hugely good point. If the persistent data store also supports being able to subscribe to writes from the websocket service then that would be a massive advantage as we'd only be using one data store/event bus.

lennym commented 8 years ago

I do not think we need s3 yet.

We're going to need some kind of data store soon to support the subqueries described at https://github.com/numo-labs/sdk/blob/master/notes/api.md#subquery-methods and while this doesn't have to be S3 (and probably shouldn't be, based on the requirements we have for it - filtering, sorting etc) it needs to exist in some form.

numo-labs / aws-lambda-helper

Writing Package Results to S3 costs a Fortune! (we need to re-think it) #73

See: https://tc-jira.atlassian.net/browse/ISEARCH-282