Gateway Pooling - Githubissues

swelham commented 7 years ago

Gateway pooling will provide the ability have multiple instances of a single payment gateway running under a single supervisor that will be responsible for selecting a worker to carry out a single request.

Original requirements

Change the gateway_supervisor module to start gateway pool supervisors instead of the payment gateways directly
Default config option to specify how many workers each gateway pool should create
Gateway config option to specify the number of workers the pool should create, this will override the default pool workers config
When a gateway returns a non success response, retry the request on another worker within the pool
Default config option to specify the maximum number of attempts that should be made for a failed request within a given gateway pool

aseigo commented 7 years ago

Instead of a pool, one could instead implement a rate-limiting approach where every request implicitly creates a new worker. While poolboy makes pooling really easy and nice, with a pool one still needs to decide what to do when the pool is full (retry after a timeout? fail? ..) and tune pool sizes. With rate-limiting you can just spin up as many workers as currently needed, and forget about the complexities.

Payment gateways to do rate-limiting on their end, and return HTTP 429 status codes when hitting the max. This can be handled easily enough with an exponential back-off in the processing worker that tops out at some point with a failure.

(For in-bound rate limiting: cashier can simply require the user to provide that, preventing the need for assumptions in cashier about the user's design and use cases.)

If really needed in future, job queueing and back-pressure mechanics could be added to the internals to handle load spikes, API limits, etc. I doubt many people will run into real-world situations that could justify that complexity, but at least one can see a path forward there if it is ever needed.

swelham commented 7 years ago

That's a great point and thinking about it a much simpler approach.

There would need to be some form of state somewhere to be shared with the workers when created to prevent unnecessary authentication requests for tokens.

Would you see there being a dedicated supervisor module for each type of gateway where this kind of state can be stored and passed along?

swelham commented 7 years ago

I have been having a think about this and from what I have read GenStage looks like a good fit for this feature. I have yet to develop anything with GenStage so I need to have a play with it to fully wrap my head around how it works.

Alternatively if there are any other suggestions or thoughts it would be good to get them down here so all options can be looked at.

aseigo commented 7 years ago

There would need to be some form of state somewhere to be shared with the workers

An ets table entry should be enough for this, with a named process to fetch on lookup failure. Sth like:

on init, start a supervised process that is used to fetch shared state into the ets table
provide a function that returns the shared state, attempting first to fetch from ets
if it does not exist in ets, then call (blocking) the fetcher proc to get the shared state and put it in ets. This becomes the point of synchronisation., preventing more than one request for populating the shared state from being made

The result should be really fast lookups in the common case, with state that can be invalidated whenever, and data populated through a single ( purposeful bottleneck) process.

 GenStage looks like a good fit for this feature

It would certainly provide an easy path to transparent pooling with back-pressure back off. If you do go for a pooling mechanism, rather than a "spawn a process per request" approach, GenStage should be a rather nice fit. Would make plugging things like logging/stat generators post-transaction really easy as well if desired in future.

The only concern I can see is that if the transaction rate is limited by the time it takes the payment processor to do its job (particularly true if the user is expected to perform any interactions with the payment service), then any pooling system will get back logged easily due to that.

I suppose with a GenStage approach each consumer in the pipeline could handle multiple async requests to alleviate that ... but that kind of complexity usually just moves the problem elsewhere (e.g. bookkeeping overhead).

This would really suck for apps with peak times for purchases, as the precessing could easily bog down right when you don't want it.

IME BEAM processes are really cheap (and.very simple) while providing a great abstraction for "the state of a job". If rate limiting is needed, the gateway service will inform the app with a rate limit error, and then processes can take appropriate action at that point.

So I would expect the whole thing to come down to: where does latency occur most acutely in the system (the app, cashier, the payment gateway service?), and where are bottlenecks most appropriate to ensure the lowest (realistic) latency for payment processing (in the app, in cashier, inthe gateway service?) I don't have enough hands-on experience with payment gateways at higher volumes to know those answers.. :/

swelham commented 7 years ago

I have been experimenting with GenStage as a proof of concept (see the refactor branch - this is a bit broken at the moment) and so far I feel this approach could work out well (but I'm still open minded). I will try and address some of the key points above the best I can in my half asleep state!

An ets table entry should be enough for this, with a named process to fetch on lookup failure

This is also the conclusion that I have come to now and makes perfect sense.

If you do go for a pooling mechanism, rather than a "spawn a process per request" approach, GenStage should be a rather nice fit.

GenStage can give us spawn per request when using a ConsumerSupervisor which I have been using per gateway.

This would really suck for apps with peak times for purchases, as the precessing could easily bog down right when you don't want it.

This one is a big problem and unfortunately (or fortunately I guess) one that I haven't ever had to deal with in my day job yet. However this blog post from discord is an interesting read and addresses the same problem.

So I would expect the whole thing to come down to: where does latency occur most acutely in the system (the app, cashier, the payment gateway service?), and where are bottlenecks most appropriate to ensure the lowest (realistic) latency for payment processing (in the app, in cashier, inthe gateway service?) I don't have enough hands-on experience with payment gateways at higher volumes to know those answers.. :/

I am in the same boat as you here and I will be looking to load test which ever solution we end up with to gauge the limitations of cashier. Coming up with the acceptable benchmarks will be a case of researching other payment libraries and reaching out to those that have implemented this kind of thing in large load environments.

aseigo commented 7 years ago

That blog entry from Discord was really cool; read it when it was published (via .. Reddit? hmm..)

Anyways, it is all a bit of an implementation detail .. and given that as you point out, people are pushing huge volumes through GenStage based code as it is, worst case scenario is a refactor down the road when someone actually has 1000s of simultaneous purchases happening .. at which point I'm sure they'd be able to contribute something back ;)

So, ... after some further pondering today, it became clear to me that I would 100% support you in a GenStage direction, should you decide that route ... cheers!

swelham commented 7 years ago

That's great, thanks!

I should have the initial refactor to GenStage done by the end of the week which will hopefully clarify whether it's the correct approach or not. One cool side effect that I hadn't considered is the ability to start up a custom gateway that cashier isn't aware of and have it easily hook into the existing pipeline - currently using this for testing.

But anyway, time will tell if a further rethink is needed or not.

swelham / cashier

Gateway Pooling #2