luisdamiano / gsoc17-hhmm

Bayesian Hierarchical Hidden Markov Models applied to financial time series, a research replication project for Google Summer of Code 2017.
Creative Commons Attribution Share Alike 4.0 International
115 stars 43 forks source link
gsoc-2017 hidden-markov-model machine-learning r stan

Bayesian Hierarchical Hidden Markov Models applied to financial time series

Results of our research replication

Read online our resulting write-ups:

Read below to know more about our project!

Primary goals

This project is part of the R Project for Statistical Computing in Google Summer of Code 2017. The goal is to replicate research in Hierarchical Hidden Markov Models (HHMM) applied to financial data. This model is a generalization of Hidden Markov Models (HMM), which in turn are part of the Dynamic Bayesian Networks (DBN) family. We identified four academic works with interesting ideas and applications that do not provide data nor code. Two of these four candidates works were replicated as part of this project. Replication will allow future readers to assess the credibility of the results and will work as a shortcut for those wanting to integrate this model into current research code (for example, for backtesting). We also produce a brief summary of the mathematical treatment of HMM.

Secondary goals

The concept of hidden states could enrich many trading strategies. A detailed replication that provides literature review, literate programming and reproducible code will allow future readers to implement HHMM logic into existing trading frameworks (for example as covariates, signals and benchmarks). In other words, already existing R packages may in a future leverage on the code published as part of this project. While the development of such a general framework is not part of this project, the delivery will be written with this eventual future enterprise in mind.

Additionally, the code and the report published as part of this project could be adapted and/or expanded for other educational purposes. While we did not expect this to happen within the project lifetime, the delivery was written with the intention of serving as a first draft for the later development of a Case Study to be proposed to the Stan Development Team.

Brief introduction

Jangmin et al. (2004) first proposed a HHMM to mimic dynamics of price trends in the stock markets. Hassan (2005) is one of the most popular original works that proposed a HMM for financial data. Based on daily data, they use four latent states to forecast stock market closing prices. Tayal (2009) builds upon technical analysis concepts and proposes a very interesting feature extraction procedure and its corresponding HHMM for high frequency data. His work focuses mainly on the statistical side of the model, with great emphasis on data description, inference and goodness of fit assessment. Finally, Sandoval and Hernández (2015) propose a very concrete application of HHMM to high frequency trading of foreign currency. The paper improves Tayal's treatment of the change in trade volume.

Although the selected works develop different financial models in diverse market contexts (asset, exchange, frequency and strategy), they share the underlying statistical logic: model parameters are time variant and change non-linearly according to unobservable discrete market states. As for the replications, this statistical common core is a strong incentive to write reusable code that will naturally allow for a certain degree of generalization.

Running the replications

We encourage the reader to try the code by themselves, possibly using data of their own interest.

All the work is organized in a few folders at root level:

Each folder may have inner folders for R, Stan and RMarkdown code.

Prerequisites

Contributing

Reach us at #r-finance (freenode.net).

Authors

License

Unless otherwise noted, Bayesian Hierarchical Hidden Markov Models applied to financial time series is licensed under CC-BY-SA 4.0. See the LICENSE file for details.

The data included for replication of the Tayal (2009) was retrieved from Thomson Reuters Datastream Tick History. Thomson Reuters has generously agreed to allow us to make the data available under the CC-BY-NC license. Permission was granted in Thomson Reuters Datastream support Case: 06017553 "Process [for] Obtaining Permission to Distribute Data for Research Purposes" (case not publicly accessible, for historical/source reference only). See the LICENSE file in the /tayal2009/data folder for details. By accessing and using this data, you agree to be bound by the terms of this license, or to obtain a different license from Thomson Reuters. Any copying or distribution of this data must preserve the license.

The authors extend their thanks to the Thomson Reuters legal team for supporting reproducible science by allowing the distribution of this data under a permissive Creative Commons license.

Acknowledgments

On our GSoC Project

In the beginning...

I first heard about GSoC on March 2017 from Brian Petersen. A long term contributor to the R Programming language with a prominent role in the R/Finance community, he has successfully mentored students under this program to produce Open Source software. As a student in applied statistics with working experience in Corporate Finance and Asset Management, I enjoy doing research on quantitative topics related to financial markets, time series, Bayesian statistics, R and Stan. We agreed that research replication on Hierarchical Hidden Markov Models applied to finance time series would be fruitful from the computational and the finance point of view, providing the R community with both software and analysis based on domain knowledge.

He introduced me to Michael, a well rounded PhD student in Statistics at Rice University with experience in Bayesian statistics, R, Stan and financial applications. Although he first warned us that we may be setting the bar too high for a summer project, he said that the attempt was worthwhile and agreed to team up.

First steps

I started the literature review before being accepted for the project. I believed that the review would be useful for my personal future research even if the proposal had not been accepted. This proved to be key as time turned to be our most cruel constraint. I was fortunate that my school had prepared me well to do research on my own, interaction at this point in time was still incipient.

In preparation for the actual papers, I first coded some basic routines for HMM in the Stan programming language. I am very grateful to the whole Stan Development Team as I relied greatly on the Stan manual (Stan Development Team 2017). Besides, I created R code for diagnostics and visualizations as well as other common code for Rmarkdown. Visualization functions would later prove very useful for diagnostics and effectively sped up the writing stage.

Next, I started with Hassan (2005). Following the calibration by simulation methodology (Cook, Gelman, and Rubin 2006), I wrote a few routines in R to draw simulated samples from several IOHMM variations. I had read about this approach on blogs and other information sources, but Michael pointed me out the paper as a more robust framework. After coding the models in Stan language (IOHMM with regression model and IOHMM with mixture gaussian variants), I run many diagnostics to ensure that the software would recover the parameters correctly. As the sampler suffered from divergences, I devised an alternative IOHMM with a hierarchical mixture gaussian that vastly improved estimation efficiency.

Confident that our software worked as intended, we finally studied real data. Quantmod (Ryan 2008) made data adquisition and pre-processing trivial. We obtained satisfying results from our model and worked on comprehensive write-up.

Naturally, the following step was the second replication: Jangmin (2009). It involves a semi-supervised HHMM that seemed overly complicated at first glance. After many days trying to write programming routines that would reflect the model truthfully, I sent an e-mail to my mentors with the subject Stuck!. This is where being supported by good mentors makes the difference. Michael studied the paper with me and we realised that most of the complexity was, in fact, lack of clarity in the definitions and the methodology. Moreover, during the discussion we observed that the article proposed an ad-hoc solution to a very specific situation, thus becoming of little interest to the broad R community. Consequently, Michael, Brian and I agree to skip this work and head for the next.

No doubt, Tayal (2009) looked challenging from the very beginning. The feature engineering procedure, certainly very interesting from the financial point of view, was described thoroughly but involved a good deal of coding time nonetheless. We also anticipated that working with high-frequency data would increase dramatically the time we needed to run our computations. Finally, we expected the generalized forward algorithm to be difficult to implement in Stan and, thanks to the insight provided by Michael, we decided to reestate the HHMM into an equivalent HMM. After a detailed study of model constraints, I succeeded in simplifying the HHMM into a HMM with only three free parameters for the hidden dynamics. Again, we tested our software with simulations.

Although implementing the generalized algorithms would be a very worthwhile enterprise on its own, our approximation was acceptable given the amount of data and the highly constrained HHMM we were working with. Furthermore, we created a faster sampler for walk-forward backtesting that would implement only the bare minimum computations needed, leaving many other useful and interpretable quantities present in our full sampler. Implementing in Stan a fully Bayesian version of all the algorithms involved in a HHMM, mostly based on the original work of Fine (1988), would be an involved enough project to be considered a part of the contributions required by a PhD program.

Brian offered to acquire the data as the high frequency dataset used in the original article was not publicly available. Data processing was trivial thanks to a parser in the FinancialInstrument R package (Carl, Peterson and See 2014). I only had to address a few issues as I was not using a Unix platform (won't let this happen again, sorry!).

Interaction with mentors

We agreed on a rich interaction methodology:

We were also involved in several aperiodic, on-demand interaction sessions (mostly e-mails and chats).

In restrospective...

I am very happy with the whole GSoC experience. Long programming hours improved my R and Stan skills greatly, but I am mostly grateful for many other learning opportunities: teaming up for collaborative programming, creating and maintaining a github repository, interacting with the open source community and discussing my ideas with skillful and experienced mentors.

I would recommend any student interested in software development and programming languages to take up a GSoC Project. Just send an e-mail to your favourite mailing list asking for a mentor. In the very rare case where nobody volunteers, you can ask for further guidance in our chatroom.

If I had to give future students one piece of advice it would be this: make sure you choose a topic you really like. This way, you will enjoy your GSoC journey as much as I did.

References

Carl, P., Peterson, B. G., & See, G (2014). FinancialInstrument: Financial Instrument Model Infrastructure for R. R package version 1.2.0. URL https://CRAN.R-project.org/package=FinancialInstrument.

Cook, S. R., Gelman, A., & Rubin, D. B. (2006). Validation of software for Bayesian models using posterior quantiles. Journal of Computational and Graphical Statistics, 15(3), 675-692.

Hassan, M. R., & Nath, B. (2005). Stock market forecasting using hidden Markov model: a new approach. In Intelligent Systems Design and Applications, 2005. ISDA'05. Proceedings. 5th International Conference on (pp. 192-196). IEEE.

Jangmin, O., Lee, J., Park, S. B., & Zhang, B. T. (2004). Stock trading by modelling price trend with dynamic Bayesian networks. Intelligent Data Engineering and Automated Learning–IDEAL 2004, 794-799.

Ryan, J. A. (2008). quantmod: Quantitative Financial Modelling Framework. R package version 0.4-10. URL https://CRAN.R-project.org/package=quantmod.

Sandoval, J., & Hernández, G. (2015). Computational Visual Analysis of the Order Book Dynamics for Creating High-frequency Foreign Exchange Trading Strategies. Procedia Computer Science, 51, 1593-1602.

Stan Development Team (2017). Stan Modeling Language: User’s Guide and Reference Manual. Version 2.16.0.

Tayal, A. (2009). Regime switching and technical trading with dynamic Bayesian networks in high-frequency stock markets (Master dissertation, University of Waterloo).