tensortrade-org / tensortrade

An open source reinforcement learning framework for training, evaluating, and deploying robust trading agents.
https://discord.gg/ZZ7BGWh
Apache License 2.0
4.45k stars 1.01k forks source link

Rewarding PBR on previous transaction #449

Closed JonathanLacoste closed 1 year ago

JonathanLacoste commented 1 year ago

I have implemented the basic ray example found on:

https://www.tensortrade.org/en/latest/tutorials/ray.html

and made a basic change to the reward function. The idea was to reward the price difference since the last transaction instead of the price difference since the last step. When I run this example multiple times from fresh new environments I seem to get incoherent results.

Another slight change from the example is I run tune with tf instead of torch

I am certain somebody here has though/implemented this logic. Anybody see any issue with this?

Here is my new reward function

`

def __init__(self, price: 'Stream'):
    super().__init__()
    self.position = -1
    self.lastWorth = 0
    self.previousLastWorth = 0
    self.lastAction = 0

    price = Stream.sensor(price, lambda p: p.value, dtype="float")
    price = price.rename("price")
    self.priceFeed = DataFeed([price])
    self.priceFeed.compile()

    position = Stream.sensor(self, lambda rs: rs.position, dtype="float")
    previousLastWorth = Stream.sensor(self, lambda rs: rs.previousLastWorth, dtype="float")

    reward = ((price - previousLastWorth) * position).fillna(0).rename("reward")
    self.rewardFeed = DataFeed([reward])
    self.rewardFeed.compile()

def on_action(self, action: int):
    if self.lastWorth == 0:
        self.position = 0

    if abs(action - self.lastAction) > 0:
        self.lastAction = action
        self.position = -1 if action == 0 else 1
        self.previousLastWorth = self.lastWorth
        self.lastWorth = self.priceFeed.next()["price"]
    else:
        self.position = 0

def get_reward(self, portfolio: 'Portfolio'):
    return self.rewardFeed.next()["reward"]

def reset(self):
    self.position = -1
    self.rewardFeed.reset()
    self.priceFeed.reset()
    self.lastAction = 0
    self.lastWorth = 0
    self.previousLastWorth = 0

`

And just in case, here are some versions I am using.

Ubuntu 22.04.1 LTS Python 3.10.4 tensorflow==2.9.1 tensortrade==1.0.3

Furthermore, I am not certain if this is a bug, as I am quite new to RL and python, however neither of the links recommended for more generic questions seem to work:

Please ask your question on StackOverflow or on the TensorTrade Discord #help-desk channel instead of opening a GitHub issue.

carlogrisetti commented 1 year ago

At first thought it doesn't seem right to reward each step for something potentially negative:

You open at 10$, asset increases at 15$ and you reward based on +5$ Asset then decreases to 12$ and you reward based on +2$ but you effectively lost 3$ which could have been acted upon (ie: selling before the drop). Then price decreases to 10$, you reward based on +0$. You didn't profit anything but the agent has been cumulatively rewarded for +7$ Worst case even, the asset now goes to 9$, your agent gets a negative reward for -1$ You lost money overall, but you grant +7$-1$ in rewards to the agent.

I haven't looked at the code, but if this is the behavior there is a logic issue underneath...

JonathanLacoste commented 1 year ago

@carlogrisetti Thanks for your feedback, let me clarify the logic I'd wish to put in.

The idea would be to only reward when a trade is made. Any other time I would want to return 0 reward if there is no action taken.

So when a trade is done, I would compare to my previous value traded.

Lets say I buy at 10$, then sell at 15$ --> reward of 5 If I buy at 10% and then sell at 5$ --> -5

If last trade was sell at 15, then I buy at 10 --> 5 if last trade was sell at 15, then I buy at 25 --> -10

As an example: buy at t0 price 10$, sell at t3 price 15 buy at t5 price 17

t0-t1-t2-t3-t4-t5 0 0 0 5 0 -2

First transaction t0 happens at my baseline, 0 difference. t1 and t2, no action --> 0 reward t3 is a sell, calculate price - last price --> 15 - 10 --> 5 reward t4 no action --> 0 reward t5 is a buy --> (17 - 15) * -1 --> -2

Clearly there is an issue either with the implementation or with my logic, but the way I see it, this approach should always converge in the simple sin curve example to buy at the lowest possible value and sell at the highest possible value

carlogrisetti commented 1 year ago

Soooo you're training a bagholder :) It will possibly never learn to sell at a loss, because holding is always better than having a negative reward.

Also you're punishing it for buying at a higher price than what he sold for, encouraging once more not to sell even at a profit.

I see what you are trying to achieve but I personally don't think it's a viable (or at least better) approach than even just a simpleprofit could do.

What issues does it aim to resolve? (Genuine, non-rethorical question)

JonathanLacoste commented 1 year ago

Hmm I do see what you mean, but it is unclear to me why it would ultimately lead to a bagholder. Again, I am really new at this so feel free to correct me on my logic, but the way I thought it would play out is if he is actually losing money, he would prefer holding. But once he holds, he would learn to optimize to make profitable trades.

Although I do agree that possibly it would not want to sell at a loss at first, his next move which would be a buy would be compensated more than not making a move. So ultimately even though it would get a negative reward at the beginning, the next trade would have a higher positive reward (or a less negative), which would ultimately lead to it learning to sell at a loss if it means more gains in the future. That's just how I imagined it, at least.

As to the issue it aims to resolve, I thought it would be a better approach not to penalize not making a move in uncertain times. So taking only actual actions on the environment as the rewards to me lead to a more robust evaluation on how the bot behaves. Going more in the sparse reward direction, it's a hypothesis at least :)

carlogrisetti commented 1 year ago

@JonathanLacoste what if the asset you've bought is not recovering and it's instead always lower than your original entry point? Even if in that period there would have been numerous up and downs from which the agent could have profited, it's still locked in on that trade, because holding is better than selling at a loss, in any case... no matter if that's -0.1% or -50%

It's true that if that is only your first trade, for let's say 20% of your net worth, you can still train the agent on the remaining 80% of the capital, but it's anyways suboptimal to do so.

A way you can trick the agent into being more stable is defining a higher commission value in the exchange definition you're using. This will force the agent to take action with a higher degree of certainty (it has to be sure to cover a higher than normal commission fee, thus can't afford to trade back and forth on a tiny gain). In any case, frequent trade is a symptom of too few training (if I remember correctly, sorry but I'm lacking in the recent past)

JonathanLacoste commented 1 year ago

Since pictures speak 1000 words, here is where my imagination thought it would converge to in that use case, though I tend to have a strong imagination sometimes so take it with a grain of salt :D

image

Either way, SimpleProfit is definitely not a bad avenue is general, that is what I had started with. I just thought this could be a somewhat similar approach but with other advantages.

carlogrisetti commented 1 year ago

Ok... One flaw I see is that you reward the second buy you make based on how lower than the sell it has been done at.

This does not mean that it will go up, it can easily go down, but you are still rewarding that action. In that case, the most correct approach would be to find a way to reward the sell trade, not the buy... It's difficult because you would have to add rewards back in time, but that's the action you want to do.

You could also let everything stay as it is, and rewarding the agent for not having a position open while the asset is decreasing in value.

JonathanLacoste commented 1 year ago

@carlogrisetti

I do agree with your statement, but technically the reward should be applied to the combination of both the sell earlier and the buy. To me though this should still converge to the optimal solution because of gradient descent, in the sense with enough training it should learn to sell at the peaks and buy at the throughs. If it started and the buy was halfway down the hill, then through gradient descent it should prefer to go down the hill as it would get more rewards. And It should not start going far up another hill because it would provide less rewards.

Something still bothers me unfortunately, I will keep pondering about this but either way, many thanks for taking the time and sharing your brainpower for this, it is greatly appreciated. I will close by the end of this week is no further thoughts are provided

carlogrisetti commented 1 year ago

@JonathanLacoste be careful in not misinterpreting the gradient descent into something that is converging to a "ground truth", which we don't have. There are multiple valid solutions to our problem, each one with its peculiar path :)

JonathanLacoste commented 1 year ago

@carlogrisetti Well said Carlo, it is the programmer in me that has a hard time not understanding something and just letting it go. Especially since this is not a question of converging, since it does, it is how it behaves after it converges.

image It somehow keeps sliding to the right until it reaches rewards close to 0, which I thought maybe it was granting rewards always for the next step. image

All the way down to: image

It baffles me, going from optimal rewards back down to converge at 0. Maybe I am just suffering from slight PTSD from watching it converge and lose it's optimal state, I think I will let it rest for the time being and go back to it in a bit, sometimes a bit of retrospective bring clarity.

carlogrisetti commented 1 year ago

Why all those trades? I would expect it to converge towards a small amount of trades concentrated in the peaks\valleys, not to spread those all over the place. What action scheme are you using? Can you try BSH and see how it behaves?

JonathanLacoste commented 1 year ago

Sorry I have been busy moving across countries and have not had the time to follow up on this. Those were not trades but actions decided (trades would only occur if switching between buying and selling). Either way will close the subject, I don't believe it is due to tensortrade.

Thanks for the support Carlo, all the best