tspurway commented 9 years ago

The implementation of this is entirely server-side, and will make changes to both Splice and Onyx.

Splice

When splice creates the distribution files, it will need a way of supporting multiple distributions for a single GEO/LOCALE. In addition, we will need to store the locations of these multiple tile sets in the 'index' file that is persisted into S3.

Onyx

Onyx will need to read the new index format that has multiple possible distributions, and to choose one at random when a Firefox user in a matching GEO/LOCALE performs a daily 'fetch' operation for their tiles.

oyiptong commented 9 years ago

Unsure if this is a worthwhile feature. We can get this implemented client-side once remote-newtab pages land.

The cost of this implementation needs to be estimated; it will have a short shelf-life.

oyiptong commented 9 years ago

How much overlap is there to do both client-side and server-side decisioning? In theory, the ingestion aspects are the same.

Client-Side Decisioning

In the client-side decisioning scenario, we'll want a new API endpoint with minimal changes to the way onyx currently work. It would merely send a new payload format with multiple possible choices for sponsored tiles.

Server-Side Decisioning

In the server-side scenario, onyx will need to do more work.

There are actually two scenarios:

multiple distributions are generated by splice, onyx just makes a choice at random and serves
only one distribution is generated. For versions of Firefox that can't do client-side decisioning, onyx would need to generate a distribution on the fly.

Given that this is a temporary fix, 2 seems like an acceptable computational cost. That minimizes work on the splice-side of things going forward since multiple-distribution generation will be obsolete quite quickly.

Since we have ASG, the increased computational cost from JSON serialization/deserialization won't have much operational impact. Plus it's only a temporary cost.

ncloudioj commented 9 years ago

I agree with @oyiptong that we'd better implement this feature through the upcoming RNTP. It only needs a minimal change to the current back-end applications.

One major downside of server-side decisioning is that it'll be rather annoying to implement and deploy features like weighted randomized serving or any other non-trival strategies. That is going to depend on two moving parts (Splice&Onyx).

I'd recommend that let Splice ship all the sponsored tiles to FF, and let the client decide which one will be served each time, based on the logic inside RNTP.

tspurway commented 9 years ago

So we can use both server-side and client side decisioning.

The first use will be to hedge the possibility that client side decisioning doesn't land in Firefox. Without decisioning, Splice2 is dead in the water, and cannot work.

Once client side decisioning has landed, we can still use the server side to:

break down very large tile sets (maybe channel specific)
break up user population for cohort for testing

So I am arguing that we do both; but that we do server side first.

oyiptong commented 9 years ago

I think it is a reasonable hedging strategy to do both. Either in case RNT doesn't land or in case it lands later than anticipated.

The implementation method, however, is what I disagree with from the original post. I think it makes more sense to only have one distribution, and to make onyx do more work server-side, even if it means it will need to serialize/deserialize JSON.

That way, there will be less temporary work and less moving parts. The server-side decisioning will remain for v1/v2/v3 API endpoints. the v4+ onyx endpoints will only do client-side decisioning. The bulk of the traffic will naturally shift to v4+.

The total infrastructure cost will be much smaller than the human cost of implementing something only to scrap it shortly later, requiring re-implementation.

tspurway commented 9 years ago

@oyiptong I thought I was being vague about the actual implementation of this ;-) We are still obviously up for suggestions on how to implement, but here are my thoughts. I worry about Onyx 'doing too much', as it is the front-line, and if it get's bogged down in tile-twiddling, we could introduce a nasty end-point delays and possible busy waits. I also worry about caching and cache invalidation, and how to support that simply. I know brute force doesn't scale, but it's a starting point.

tspurway commented 9 years ago

The one thing we can easily run into is a thundering herd during lazy cache loads/misses if we have a synchronous fetch, so I really think a 'zero wait' approach to fetches needs to be the way this is implemented. The client will always get an immediate answer, even if it's an error.

One solution is to have a default tileset that is returned for any geo/locale combination. We could use the current Splice infrastructure to provide this default, then use a message queue and separate queue processor to lazily load the tileset into s3. This is complicated if we stay with s3 based tilesets, as Onyx would need to check for the existence of the s3 bucket before dropping a message in the queue. It also presents a thundering herd possibility on input to the queue, which i guess could be solved with in-memory check of outstanding requests and de-duplicating messages in the processor.

I am kind of still leaning towards a brute-force approach. Load all possible combos into s3.

tspurway commented 9 years ago

We also have an issue with multiple tiles with the same TLD+1 in their target_url that only one of the tiles ever gets displayed (https://bugzilla.mozilla.org/show_bug.cgi?id=1207349).

We can fix this in the short term by using the same decisioning we are using here for multiple simultaneous campaigns. We would also consider tiles that have the same TLD+1 to be in different groups, and would create separate distributions based on this criteria.

Ideally, we could add a weight to the tile to indicate that it should take a higher or lower percentage of impressions (fetches, actually).

mostlygeek commented 9 years ago

The one thing we can easily run into is a thundering herd during lazy cache loads/misses if we have a synchronous fetch, so I really think a 'zero wait' approach to fetches needs to be the way this is implemented. The client will always get an immediate answer, even if it's an error.

Fetches fail silently on the client side. So we don't have to worry about latency so much. We should make sure our error rates are low. Right now less than 1/million requests fail.

Thundering herd issue can be mitigated with nginx micro-caching and asynchronous cache updates. The source (splice?) would only need to handle a few requests per second.

I'm +1 on moving away from S3/Cloudfront towards a fully dynamic system with nginx caching.

mozilla / splice

Support multiple concurrent directory tiles (scheduling) #108

Splice

Onyx

Client-Side Decisioning

Server-Side Decisioning