Implement a recovery mechanism for failed http requests

gzhytar commented 7 years ago

krakenex: 2.0.0rc2

What are you trying to achieve?

During peak times large amount of http requests are failing without a specific response from Kraken server. These are intermittent problems with CloudFlare/Kraken and for some of returned http codes it should be safe to retry an operation. As each request has a unique nonce which is checked on a server, it should prevent execution of several identical requests if they get to a server despite a failed status_code.

Max number of retries should be configurable

EDITed by @veox:

Pull requests with the suggested change: ~~#65~~ #99 #100

veox commented 7 years ago

I find the idea reasonable; however, look at https://github.com/veox/python3-krakenex/pull/65#issuecomment-343757007 for some remarks.

Most importantly, in the face of network congestion, a "retry" approach must be live-tested to determine that the particular HTTP errors don't result in duplicate orders. (Hence "need info".)

OT: Have you encountered any other major issues with 2.0.0rc2? I'm not sure how to interpret the relative silence concerning it.

Are there no other bugs? (Hard to believe.)

Does no one use it, since pip doesn't install it by default? (Quite likely.)

Did everyone give up on Kraken, since it lags so much? (Passionately empathising.)

gzhytar commented 7 years ago

Thanks @veox, your comments in the pull req make lots of sense. I'll upload a new revision once test it for a while.

I've started using krakenex few days ago through https://github.com/Endogen/Telegram-Kraken-Bot and was annoyed with constant networking problems as original version of the bot references krakenex 1.0 (i made a pull request today to use 2.0.0 RC).

I wish i looked at your repo and notice that you already have 2.0.0rc, but i didn't. Ended up rewriting krakenex to use Requests on my own, used it for couple days and only today noticed that you already did all the hard work ;)

My observations are that 2.0.0 rc is more stable, especially in multithreaded use-case (as telegram bot has several active threads sending requests). Adding retries also increased a success rate making orders and querying balance. I haven't noticed any issues so far.

And yes, lots of folks who are still trying to use Kraken through web UI are giving up now.

veox commented 7 years ago

For ref: a query that resulted in a 502 HTTP error can definitely "go through" to the trade execution engine. I've observed this personally on many occasions (see PSA on /r/krakenex).

The assumption that a nonce will prevent duplicate orders can therefore be tested if attempting a retry on a 502.

EDIT: To clarify: this is risky, in general don't do it! I've already tested this (see https://github.com/veox/python3-krakenex/issues/66#issuecomment-344372672), no need to step on the rake. :)

veox commented 7 years ago

I'll try to get to a "proper" v2.0.0 release next week, with pip package and all. Hopefully, that'll get more people to use it. Also, live-testing PR #65.

gzhytar commented 7 years ago

Cool. Telegram-bot author just accepted a pull request so hopefully there will be a few more folks on 2.0.0rc

On Sun, Nov 12, 2017, 20:46 Noel Maersk notifications@github.com wrote:

I'll try to get to a "proper" v2.0.0 release next week, with pip package and all. Hopefully, that'll get more people to use it. Also, live-testing PR #65 https://github.com/veox/python3-krakenex/pull/65.

— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub https://github.com/veox/python3-krakenex/issues/66#issuecomment-343761947, or mute the thread https://github.com/notifications/unsubscribe-auth/AOQAs1PrGUUaE0f2zpQ79-oFT-8ID8MBks5s10sfgaJpZM4Qa6qp .

-- -- Georgiy Zhytar

Check out what I'm doing and subscribe to get your weekly portion of inspiration and knowledge to learn leading wisely http://LeadingWise.ly

philippose commented 7 years ago

Hello, A Good Evening to you both.

Firstly, @veox, thank you very much for this very functional, but at the same time extremely simple to use interface to the Kraken API. I have been using this library for about a month now... Not trading yet, but still working on putting together a platform for automated trading (maybe someday, I will get that far!)

Just a comment on this particular proposed change...: I was considering a similar idea because of the increased number of HTTP errors in the last few days. However, like @veox, I do not think it is a very good idea to implement the retry logic so deep in the library. Most of the time when I received this error while adding orders, the order had gone through. In such situations, it would be catastrophic to have an automated retry, which would result in multiple orders.

I personally think that the implementation of the retry logic should be placed higher up at the application layer. something like "try, except" blocks could be used to process the error, and then probably send a request to check the current orders to first verify the status of the order before attempting a retry.

Regards, Philippose Rajan

veox commented 7 years ago

@philippose Thanks for your thoughts!

I do tend to agree in general that it's "better" to scream loudly of a fault and do nothing when money is involved; however, the try/except approach doesn't allow for simple nonce reuse (the API-level one), as proposed here.

Perhaps, if re-using the PreparedRequest, instead of constructing a new one in the except block...

Otherwise, one could add userref parameters to orders when placing them, and if that fails:

wait for an (arbitrarily) "long" time;
look up the order by its userref;
if it's not found, retry sending the order.

This is error-prone, though, since the time-to-wait seems to depend on the load on Kraken's trade execution engine.

veox commented 7 years ago

Perhaps, if re-using the PreparedRequest, instead of constructing a new one in the except block...

That is, if the except caught ~~a 502~~ ~~a 504~~ something that's known to be "safe" to repeat, one should be able to access the precise request that was sent - as krakenex_API_object.response.request - and re-send that with krakenex_API_object.session.send(...).

Haven't tried it yet. Hopefully - next week.

EDIT: This will likely only work if there haven't been any queries in the interim with higher nonces, if not increasing the the nonce window in Kraken's settings... I should really go rest, though. :D

veox commented 7 years ago

TL;DR: try/except with session.send() in end-app works, but adds a lot of boilerplate (EDIT: and with a nasty repetition of in-lib code to catch subsequent exceptions). retries within lib approach as in #65 could be OK, if made opt-in.

Note: v2.0.0 has been released; I've used it for the experiment below. It has no changes compared to v2.0.0rc2.

I've tried the approach from my previous comment. It can be seen in this gist.

In short - it works. (Tried a few times.) A query that seems to fail with a 502, but does in fact go through, will be rejected by Kraken when repeated, with a response of:

{'error': ['EAPI:Invalid nonce']}

In other words, the nonce is indeed sufficient for Kraken to recognise it as a duplicate. (At least for now...)

This approach, however, adds a lot of boilerplate to the end application. The "wrapper" must be repeated for every query type, either manually or with a decorator.

This is unwieldy, and, more importantly, would likely be too complicated for a lot of people using krakenex. (From numerous issues reported this year, I'm guessing trading automation is the first stint with programming for many.)

@gzhytar's approach seems warranted - it doesn't add too much bloat, and is a solution for a common issue, especially in light of Kraken's current (abysmal) performance.

This should be, however, a deliberate opt-in setting, as mentioned in https://github.com/veox/python3-krakenex/pull/65#issuecomment-343757007. Otherwise, it's only likely to further increase the load on Kraken's TEX.

pawapps commented 7 years ago

Appreciate all the work on this, I am using v1 and have implemented similar logic into the API class.

One thing I do not quite understand is Kraken's "nonce window" implementation. From experimentation I know resubmitting a nonce with a nonce window > 0 can make for a duplicate submit.

Here's my logic/flow for private queries: Submit Query if HTTP error: resubmit query with same nonce if EService:Unavailable or EService:Busy error: resubmit query with same nonce if EAPI:Invalid nonce error and AddOrder method: spend time checking for the wayward order if EAPI:Invalid nonce error and NOT AddOrder method: resubmit query with new nonce

The Kraken has been very unhappy lately...

ps. (likely off-topic but...) how does v2 improve the reliability of the network connections? I don't see how it would make better the onslaught of 5XX server errors

veox commented 7 years ago

@pawapps The "nonce window" part would make a good separate issue. ;)

As to network connection reliability:

v1 didn't gracefully handle the case where the remote "dropped" the connection between queries. You'd have to check the connection state in advance, prior to sending the query; or reset the connection manually and repeat the failed query. The latter, when done inappropriately, could also result in two open connections, or at least a Connection object floating in memory (due to what looks like the OS-level broken socket left open)... All of this is too much networking where people are usually concerned with business logic (often literally).

In v2, requests handles this behind the scene. So you'll "only" get a requests exception as a direct result of performing a query; as compared to something that looks like a failed query, but didn't even make it to CloudFlare.

The latter is a rare case these days, which this issue is about (kind of...).

veox commented 6 years ago

It's been about a month of testing, and I haven't as-of-yet experienced a duplicate order when using the nonce-protection approach (see PR #65).

veox / python3-krakenex

Implement a recovery mechanism for failed http requests #66

What are you trying to achieve?