kernc / backtesting.py

:mag_right: :chart_with_upwards_trend: :snake: :moneybag: Backtest trading strategies in Python.
https://kernc.github.io/backtesting.py/
GNU Affero General Public License v3.0
5.44k stars 1.06k forks source link

Clarification suggestion for stats output "Buy" & Hold label in output #150

Closed blockforcecapital closed 3 years ago

blockforcecapital commented 4 years ago

First off, Thank you for the wonderful work. This is very useful and I am sure will be appreciated by many. I've dabbled with a few packages for backtesting in python and this is by far my favorite.

The following is just my opinion, and certainly not critical. You commented it in your source code but for beginners, it might be confusing. Here is the line I am referring to: https://github.com/kernc/backtesting.py/blob/2bb8f29c393fec030cfb247b16eea2aec2304f45/backtesting/backtesting.py#L1383

Expected Behavior

In a declining market, the stats line "Buy & Hold Return [%]" would normally show a negative return.

Actual Behavior

Instead, the result is calculated based on the absolute value calculation. This could be confusing to some. Perhaps add a new line "Short & Hold Return [%]" or change the line to "Best of (Long or Short) & Hold Return [%]"

Steps to Reproduce

  1. Chose a timeframe where your data series loses value
  2. stats = bt.run() then print (stats)
  3. Buy & Hold Return [%] will show a positive number.

Additional info

kernc commented 4 years ago

Thanks for the warm intro. :heart: It does look promising at first glance, but it still has many bugs to iron out. :sweat_smile:

It has indeed proven confusing a couple of times now, see e.g. https://github.com/kernc/backtesting.py/issues/140, https://github.com/kernc/backtesting.py/issues/36.

I see adding a new negative value key as somewhat pointless, but as an updated string, I'd certainly prefer something shorter, like "Buy/Sell & Hold Return [%]" or "Buy (Sell) & Hold Return [%]". You think that would clarify it? Is there another similar expression that encompasses both meanings?

eervin123 commented 4 years ago

Hi there, I'm commenting on here from my other account. I didn't realize before I was logged in with @blockforcecapital before.

Thanks for responding, I'm sorry I didn't see #140 and #36 before, and I feel silly even wasting your time with such a trivial matter. It took me a sec, but then I figured it out fairly quickly. Given that, if you twisted my arm and I had to make a change, I would suggest you leave it as a negative number and keep "Buy & Hold Return [%]" I think that is how I am most accustomed to seeing it and instantly, my brain is able to process that a buy and hold investor would have lost money so, hopefully, my strategy would have a less negative return, or better yet, a positive return.

Since we are here, I wonder if you have also considered creating a more robust set of stats on the strategy itself (in this case, the equity_df).

I started to take a stab at hacking together my own _compute_stats function but I am not nearly as experienced so it might take a while and anything I do wouldn't be worthy of a pull request.

By the way, I am happy to type this up on a different issue if you like.

For background, as an asset manager, we typically start from a top-down perspective. Start with a hypothesis, run some analysis, look at the strategy returns and risk on an absolute and relative basis, then explore the attributes that led to the strategy outperformance or underperformance. In other words, first I would want to see risk-reward metrics on the strategy, then I would start looking at the trade statistics, etc.

In the current iteration, we are seeing Sharpe, Sortino, and Calmar data on the trade_df which is very helpful, but I first want to see stats like that on the strategy (equity). This is also something I had to look more closely at the code for clarification. I was assuming Sharpe, Sortino was based on the strategy, not on the trades. It is nice to look at the trades and the risk-reward stats, but this might be misleading when the person is really expecting to see risk-return statistics on their strategy. My suggestion would be to add annualized return (or some other preferred method of (rolling period returns), standard deviation, Sharpe, and Sortino up in the equity section of the stats. Then if I really had my wish list, I would structure it like the following image.

image

Again, thank you for putting in so much effort. I wish I could help more than just making suggestions. ;)

kernc commented 4 years ago

Thanks. Your comment makes a lot of sense. No need to open a new issue: if we do need to make changes to the returned stats, it shouldn't be too often, and the issue better not stay open for too long.

if I had to make a change, I would suggest you leave it as a negative number and keep "Buy & Hold Return [%]" I think that is how I am most accustomed to seeing it and instantly, my brain is able to process that a buy and hold investor would have lost money so, hopefully, my strategy would have a less negative return, or better yet, a positive return.

Reviewing it again, I see nothing really changes except the extra minus sign. A strategy that makes no trades would still fare better than a decreasing market, but I guess nobody relies on their results as indiscriminately. :+1:

That said, I don't particularly like perpetrating the idea that today when everyone can easily short, or just stay in cash, generating less loss than the benchmark is anywhere near good enough! :laughing:

Since you managed to convince me, and I certainly don't mind growing the contributors list — would you maybe like to do the PR?

if I really had my wish list, I would structure it like the following image.

Regarding the "Benchmark (Buy&Hold)" column: You can compute a separate set of stats on a benchmark strategy, e.g.:

class Benchmark(Strategy):
    def init(self):
        self.buy()
    def next(self):
        pass

and then compare/difference the two results:

stats_numeric = stats.filter(regex='^[^_]')  # Skips objects _strategy, _trades, _equity
...
difference = stats_numeric - benchmark_stats_numeric

I think this part is thus easy enough to workaround/implement and adjust yourself.

In the current iteration, we are seeing Sharpe, Sortino, and Calmar data on the trade_df which is very helpful, but I first want to see stats like that on the strategy (equity). This is also something I had to look more closely at the code for clarification. I was assuming Sharpe, Sortino was based on the strategy, not on the trades. It is nice to look at the trades and the risk-reward stats, but this might be misleading when the person is really expecting to see risk-return statistics on their strategy. My suggestion would be to add annualized return (or some other preferred method of (rolling period returns), standard deviation, Sharpe, and Sortino up in the equity section of the stats.

This certainly bears insight into why I never found optimizing for Sharpe useful. The strategy could have made just two trades, with reasonable mean and small deviation, but this was absolutely not what anyone wanted! :grin:

I was indeed thinking there's a per-timeframe return metric missing. Annualized sounds fine, although annualising two weeks of returns on e.g. 1-second data (and we're a general-purpose framework) should be considered pretty far fetched.

So, by Sharpe based on equity you, in fact, mean the classic but utterly imprecise daily_returns.mean() / daily_returns.std() * sqrt(252)? What should be our heuristic algorithm to infer the number of trading days in e.g. forex or crypto?

How much value are really the current "trade" Sharpe/Sortino ratios? I'm leaning to just replacing them.

Additionally, some of those ratios imply some additional input parameters, such as risk-free return of Sharpe and minimum acceptable return of Sortino. (Related issue: https://github.com/kernc/backtesting.py/issues/71.) If I'd strongly prefer to avoid extra arguments, is it ok to assume those parameters at 0 as a general case and still find the returned values widely meaningful?

Again, thank you for putting in so much effort. I wish I could help more than just making suggestions. ;)

You can sponsor me when you're rich. :joy:

Thanks!

eervin123 commented 4 years ago

This is great. I'll get back to you on these points later this evening. Regarding:

You can sponsor me when you're rich. 😂

Not rich yet, but I can certainly afford to buy you a few beers over the next twelve months. It isn't much, but hopefully, that one beer per month will turn into many as others follow along. :)

eervin123 commented 4 years ago

Okay, so I'm coming back around to this.

Regarding the "Benchmark (Buy&Hold)" column: You can compute a separate set of stats on a benchmark strategy, e.g.:

class Benchmark(Strategy):
    def init(self):
        self.buy()
    def next(self):
        pass

and then compare/difference the two results:

stats_numeric = stats.filter(regex='^[^_]')  # Skips objects _strategy, _trades, _equity
...
difference = stats_numeric - benchmark_stats_numeric

I think this part is thus easy enough to workaround/implement and adjust yourself.

Yes, I did this in my main.py file. I didn't mess with your backtesting.py. Were you suggesting we modify the _compute_stats function?

# Create a dataframe to compare the strategy to the benchmark buy and hold
stats_numeric = stats.filter(regex="^[^_]")  # Skips objects _strategy, _trades, _equity
benchmark_stats_numeric = bm_stats.filter(regex="^[^_]")
difference = stats_numeric - benchmark_stats_numeric
# Combine the strategy, the benchmark, and difference data series into a single statistics df
df_stats = pd.concat([stats_numeric, benchmark_stats_numeric, difference], axis=1)
# Rename columns
df_stats.columns = ["Strategy", "Buy & Hold", "Difference"]
# df_stats.to_csv(str(stats["_strategy"])+".csv")

print(df_stats)
df_stats.to_csv(str(stats["_strategy"])+".csv")
kernc commented 4 years ago

Were you suggesting we modify the _compute_stats function?

No, since this would entail API change (returned df vs series) and would prompt users to accept (or rather complain about) our predefined benchmark strategy, I thought it easy enough to implement as you have, on your end, at your preference.

But the rest of my queries are about changes in stats that I'm quite eager to see.

eervin123 commented 4 years ago

Regarding your comments about choosing a time period eg "Annualized" I think it makes sense but you are correct that it will at times be wonky. A while back we created a couple of tools on our Reality Shares website. The first tries to calculate statistics based on the time period and attempts to annualize the numbers regardless of how small the period is.
image But when the time period is very small, it gets crazy.
image We did use logic to handle for various time periods as you can see here in the footnotes:

For the statistics to the right of the line chart: Date ranges of 10 years or longer use rolling yearly returns with daily data frequency. Date ranges from four to 10 years use rolling quarterly returns with daily data frequency. Date ranges from 10 months to four years use rolling monthly returns with daily data frequency. Date ranges from three months to 10 months use rolling weekly returns with daily data frequency. Data ranges of up to three months use rolling daily returns with daily data frequency.

For backtesting.py it may be intelligent to use various rolling periods relative to the total time period. However, that might not be a priority based on the amount of work.

We also built a different tool that just defaults to N/A when the numbers get weird. It doesn't seem to bother anyone. image

I'm happy to share the source code with you if you think it would be helpful. No guarantees that it is any good. We aren't programmers, we are good enough to get things to work.

eervin123 commented 4 years ago

Regarding annualizing statistics. I hear you and agree, it is horribly imprecise, but at this stage in the backtesting process, the imprecision is probably okay. For example, if annualizing the mean daily return using a 365 day year, then as long as you annualize the std deviation using a 365 day year, the relative nature of the two numbers will be more important than the absolute value. The same goes for 252 day period. As I mentioned above, we would typically use different periods based on the total amount of time. I have a general rule of thumb that as long as there are 50 or more unique periods in the total duration, then you can use that increment to calculate extrapolate your return and risk for the duration. For example, if you had 52 weeks in a year, you could look at the mean weekly return and the standard deviation then annualize those two using the 52 periods. However, if you are attempting to extrapolate the returns and risks over a longer duration than you have sample information, then my rule of thumb was that you had to have at least 1/4th of the total duration and at least 50 increments in that 1/4th. For example, if you had 60 one-minute increments of data, you could make assumptions about the average risk and return of 1 minute periods over the given hour but if you wanted to extrapolate that out to anything more than four hours, the results would be more and more suspect. So, if you wanted to add some logic in there, you might think about that.

Finally, I played around with Pyfolio tear sheets. Here is a sample notebook

There are some nice features, but there are still some comparisons that I feel are missing. The idea of in-sample vs out of sample is great, but I really want to see more comparisons "relative" to the benchmark buy-and-hold strategy, we see some of these in the rolling relative statistics in pyfolio. I will keep playing around with these if you would like me to give additional input.

eervin123 commented 4 years ago

On the topic of using a Risk-Free Rate as zero to calculate your risk-adjusted returns, I call that method the "naive method" and it is perfectly normal for a framework like this which is trying to evaluate the rough edges of a strategy. The precision can come after the user has decided on a strategy and is looking to institutionalize it. For initial testing and evaluations, I feel that it is perfectly normal to have some descriptive statistics that are not perfectly precise. If interest rates become meaningful again, perhaps you can revisit that. :)