nakulnayyar / ML-KNearest-VXX

Machine Learning Algo using Knearest Neighbors model on VXX trading strategy.
14 stars 15 forks source link

Look-forward bias? #1

Open windwine opened 8 years ago

windwine commented 8 years ago

Hi Nakul,

Thank you for posting the python code. When I checking the code I noticed that before transforming y into 0/1 class, we have

trainF.head() Out[77]: VRP TermStr Date
2013-01-04 10.268297 -2.510000 2013-01-07 10.190552 -2.660001 2013-01-08 8.959798 -2.830001 2013-01-09 11.286732 -2.690000 2013-01-10 5.719475 -2.630001

trainT.head() Out[78]: 1dayVXXret Date
2013-01-03 -0.020834 2013-01-04 -0.001089 2013-01-07 -0.012432 2013-01-08 0.002205 2013-01-09 -0.024904

Basically we are using 01/04 VRP and term structure to calculate the return from 01/03 to 01/04? Is that a look-forward bias? Thanks.

Regards,

Jacky

nakulnayyar commented 8 years ago

Hi Jacky,

Excellent Question! I actually popped a few mental gaskets keeping things straight while writing the code and re-popped them answering your question lol.

So take this snapshot of the data: VRP TermStr VXX VXXret 1dayVXXret Date
2013-01-04 10.268297 -2.510000 110.199997 -0.020834 -0.001089 2013-01-07 10.190552 -2.660001 110.080002 -0.001089 -0.012432

VXXret is the LN(Prev Close/Today Close) 1dayVXX ret is the 1 day FORWARD return so just VXXret shifted 1 day.

So the easy part, on the TRAINING data, I actually WANT to feed it forward returns because I want the model to fit on known data. Basically I am feeding it inputs for a day as well as the 'right answer' for it 'learn' the function properly. So on TRAINING data it is okay to feed it the right answers. That make sense?

The harder part is the backtest. I've fed it known data and it fit a model, now I feed it NEW input data. Lets take 1/4. I am using closing values but I assume by 3:58 that I know what the 1/4 input closing values will be. That's a good (but not great) assumption from experience. So I run the model on 1/4 at 3:58 data, get VRP * TRMStr values and it spits out 1 or -1. Now I also assume I can trade at Market Close (again good but not perfect assumption!) and I get 110.20 execution price and I will TRADE again at the next day closing price which is 110.08 for a -0.11% return for my trade that started 1/4.

So in other words, based on 1/4 inputs of 10.27 & -2.51 I will receive -0.11% return. If I didn't shift the return forward I would assume I received the closing price of the PREVIOUS day 1/3, based on the inputs of 1/4.

The important part is that YES in the training portion it is looking at tomorrow's return to fit the model but since this is training data, and not testing data, that assumption is okay. The backtest is based on completely new data but DOES work on the assumption that the prices at 3:58 are accurate enough and I can trade on the close. I hope this makes sense? And please let me know if I made any logical errors!

Thanks again!

windwine commented 8 years ago

Hi Nakul,

Thank you for your reply. I understand using 03:58pm data for the calculation but my concern lies someplace else.

The snapshot of the data in your e-mail is perfectly OK and I totally agrees on using close info on 01/03 to forecast the return on the next day, using 01/04 VRP term to forecast diff(log(01/05 close)-log(01/04 close)). But when you are fitting the KNN as in your code you are basically using x and y w/o matching on the dates. Therefore the X and y pair becomes:

01/04 VRP term ------- diff(log(01/04 close)-log(01/03 close)). And that is what I mean by forward-looking bias (not an exactly-right statement though). If my thinking is correct, the code can be modified to be :

Features VRP, TermStr

x = pd.DataFrame(index = stratdata.index) m=np.log(stratdata['SPX'] / stratdata['SPX'].shift(1)) x['VRP'] = stratdata['VIX'] - pd.rolling_std(stratdata['SPX'],3)

x['VRP'] = stratdata['VIX'] - pd.rolling_std(m,3)_math.sqrt(252)_100

x['TermStr'] = stratdata['VIX'] - stratdata['VXV'] x = x.dropna() x = x.ix[:-1]

Output VXXret

y = pd.DataFrame(index = stratdata.index) y['1dayVXXret'] = np.log(stratdata['VXX'] / stratdata['VXX'].shift(1))

y['2dayVXXret'] = np.log(stratdata['VXX'] / stratdata['VXX'].shift(2))

shift returns forward 1 day

y['1dayVXXret'] = y['1dayVXXret'].shift(-1) y = y.dropna() y = y.ix[2:]

y.head()

And we will vastly different and unstable results.

Kind Regards,

Jacky Chen

On Wed, Mar 9, 2016 at 9:12 AM, Nakul Nayyar notifications@github.com wrote:

Hi Jacky,

Excellent Question! I actually popped a few mental gaskets keeping things straight while writing the code and re-popped them answering your question lol.

So take this snapshot of the data: VRP TermStr VXX VXXret 1dayVXXret Date

2013-01-04 10.268297 -2.510000 110.199997 -0.020834 -0.001089 2013-01-07 10.190552 -2.660001 110.080002 -0.001089 -0.012432

VXXret is the LN(Prev Close/Today Close) 1dayVXX ret is the 1 day FORWARD return so just VXXret shifted 1 day.

So the easy part, on the TRAINING data, I actually WANT to feed it forward returns because I want the model to fit on known data. Basically I am feeding it inputs for a day as well as the 'right answer' for it 'learn' the function properly. So on TRAINING data it is okay to feed it the right answers. That make sense?

The harder part is the backtest. I've fed it known data and it fit a model, now I feed it NEW input data. Lets take 1/4. I am using closing values but I assume by 3:58 that I know what the 1/4 input closing values will be. That's a good (but not great) assumption from experience. So I run the model on 1/4 at 3:58 data, get VRP * TRMStr values and it spits out 1 or -1. Now I also assume I can trade at Market Close (again good but not perfect assumption!) and I get 110.20 execution price and I will TRADE again at the next day closing price which is 110.08 for a -0.11% return for my trade that started 1/4.

So in other words, based on 1/4 inputs of 10.27 & -2.51 I will receive -0.11% return. If I didn't shift the return forward I would assume I received the closing price of the PREVIOUS day 1/3, based on the inputs of 1/4.

The important part is that YES in the training portion it is looking at tomorrow's return to fit the model but since this is training data, and not testing data, that assumption is okay. The backtest is based on completely new data but DOES work on the assumption that the prices at 3:58 are accurate enough and I can trade on the close. I hope this makes sense? And please let me know if I made any logical errors!

Thanks again!

— Reply to this email directly or view it on GitHub https://github.com/nakulnayyar/ML-KNearest-VXX/issues/1#issuecomment-194337931 .

nakulnayyar commented 8 years ago

Hmm, a bit confused...

Just to be clear we are okay with the back test portion with dates lining up? i.e. return on 1/4 = LN(close 1/4 / close 1/7)?

Is the concern on fitting the model? The model CAN and does look forward on TRAINING data only. For example, I could feed the 1/4 input returns from 1/4 to 1/10 on TRAINING and then 1/5 input returns from 1/7 to 1/11 etc. As long as that data set is not used on the TEST set then I don't see the issue?

Not sure if that's what you're asking but I didn't really see any significant changes to the code you posted (except using LN returns on SPX before calc STD - good pickup, thanks!)

windwine commented 8 years ago

Hi Nakul,

Indeed we agree that we should use VRP and term at the close of 01/04 to fit the return of LN(close 1/4 / close 1/7). But in your code you are using VRP and term at the close of 01/04 to fit the return of LN(close 1/3 / close 1/4). The same misalignment for the test set. So basically we are using day T's info to fit the return from T-1 to T and that is why we had such good performances. The only change I made in your Python code is to align the xs an y to use the close of 01/04 to fit the return of LN(close 1/4 / close 1/7). And that's my concern about "information leakage". Originally, I used R to fit KNN regression models following the original post and my results were disastrous. When I found your post and modified my R codes to carry out the experiment again w/o any success I started to compare the detailed info from my R results and yours day by day to see what I went wrong. And the alignment "issue" is what I found. Frankly speaking, I am not sure if I am correct or not and hopefully I can figure it out with your help.

BTW, how was your vol trading doing in the 2015-2016 since our last LinkedIn message exchange in late 2014? I noticed that many of the vol funds were crushed since Aug/2015. Hopefully you were doing fine in that period.

Thanks.

Regards,

Jacky

On Wed, Mar 9, 2016 at 1:27 PM, Nakul Nayyar notifications@github.com wrote:

Hmm, a bit confused...

Just to be clear we are okay with the back test portion with dates lining up? i.e. return on 1/4 = LN(close 1/4 / close 1/7)?

Is the concern on fitting the model? The model CAN and does look forward on TRAINING data only. For example, I could feed the 1/4 input returns from 1/4 to 1/10 on TRAINING and then 1/5 input returns from 1/7 to 1/11 etc. As long as that data set is not used on the TEST set then I don't see the issue?

Not sure if that's what you're asking but I didn't really see any significant changes to the code you posted (except using LN returns on SPX before calc STD - good pickup, thanks!)

— Reply to this email directly or view it on GitHub https://github.com/nakulnayyar/ML-KNearest-VXX/issues/1#issuecomment-194466378 .

nakulnayyar commented 8 years ago

Ah okay!

(I changed the name from 1dayVXXret to VXXret on my local machine to make things easier to understand, will upgrade github accordingly)

So while this code DOES take T-1 close price AND T+0 close price:

y['VXXret'] = np.log(stratdata['VXX'] / stratdata['VXX'].shift(1))

A few lines down I SHIFT it:
y['1dayVXXret'] = y['VXXret'].shift(-1)

VXXret <> 1dayVXXret

So the TRAIN set data looks like this:
    VRP  - TermStr  - VXXret    - 1dayVXXret    -1dayVXXret
Date                    
2013-01-04  10.268297   -2.510000   -0.020834   **-0.001089**   -1
2013-01-07  10.190552   -2.660001   **-0.001089**   -0.012432   -1

See the bolding, the Return for VXX used on 1/4 is the LN(close 1/7 / close 1/4) The VXXret column is NOT USED in the model at all, only the SHIFTED dataset 1dayVXXret. I only showed it on here for explanation: In fact the actual data being fed looks like this:

    VRP  - TermStr              --1dayVXXret
Date            
2013-01-04  10.268297   -2.510000   -1
2013-01-07  10.190552   -2.660001   -1

Does that answer the question?

Speaking of your R code, what settings did you use on the Knearest model (number of neighbors?, etc?) I can't answer this but is it perhaps due to a difference in model code between R and SKLEARN?

I actually altered some of my vol trading earlier in 2015 after I left my old job, I was quite aggressively positioned there usually using something like 1x2 or call overwrites on SVXY. I started using more butterflies and calendars to cap the risk and give up some return to sleep at night. Performance was far more consistent but not as high and during August there was drawdown but short lived. This equity curve I posted on the blog for YTD is fairly representative:

image

nakulnayyar commented 8 years ago

Thanks for sending your code, the problem is here:

temp = pd.concat([x,y],axis=1)

temp = temp.dropna()
x = temp.iloc[:,0:2]
y = temp.iloc[:,2:3]

y returns the wrong columns of temp. Instead of Column 4 '1dayVXXret' it returns Column3 'VXXret'

you want:

y = temp.iloc[:,3:4]

BUT there is a problem doing it this way. Each time you run the code, x,y change and therefore temp changes and you end up losing 1 column every time you run the code. You can change x,y to z,q or something throughout or just use the existing code as is.

Hope this helps!

windwine commented 8 years ago

I only got one column in y which is the 1dayVXXret and are you using a different set of codes than the ones in the online version? My use of temp is not an elegant solution as I was only using it to show the mismatch between the old x,y. Thanks.

nakulnayyar commented 8 years ago

Hi there, Thanks for the comment. Yes using log returns is more 'correct' and I have it updated on my local machine but have not updated here. I have a number of tests I am doing and want to do but haven't found the time for but when I do will update. Thanks!

On Tuesday, June 28, 2016 6:52 AM, IndianCuriosity <notifications@github.com> wrote:

Hi Mukul, Thanks for the strategy.why didn't you calculate actual vol of 3 days ?I thot this is correct: m=np.log(stratdata['SPX'] / stratdata['SPX'].shift(1)) x['VRP'] = stratdata['VIX'] - pd.rolling_std(m,3)math.sqrt(252)100instead u usedx['VRP'] = stratdata['VIX'] - pd.rolling_std(stratdata['SPX'],3)Any specific assumptions you are making ?Thanks— You are receiving this because you commented. Reply to this email directly, view it on GitHub, or mute the thread.