Closed ssanderson closed 9 years ago
Still working on this, just didn't want the browser to crash.
@twiecki @gusgordon @justinlent these are my main comments from reading today and yesterday.
TL/DR:
BacktestResult
object on Quantopian should probably live in the same codebase as BacktestResult
(i.e. on Quantopian and/or Zipline if/when it gets ported). In particular I think we should make it easy/possible for people to access PyFolio visualizations on the research platform without actually using or understanding PyFolio at all. Power users should be able to dig into the library internals to do whatever cool custom things they want.Most of these comments are negative b/c I'm focusing on the stuff that I think should change and I'm trying to go as fast as possible so that Zipline projects don't fall behind, so I do want to say that I think this is awesome work and I'm super excited to see it in the hands of users.
I love these ideas @ssanderson . Definitely agree with exposing more informative docstrings, as well as making it more user friendly for people just analyzing backtest data (without live/out-of-sample data) since the tearsheet is still extraordinarily useful for this. Also totally agree that we should make it work in "offline mode" by removing the internet dependency (which is really only used to pull SPY data to compute beta, and the risk factors from the Fama-French site, I believe).
Thanks @ssanderson. Few comments/questions:
returns
is passed to about 30 functions. Is there a better convention when we have only ~10 things we ever pass, and each function uses a subset of those 10 things? I am guessing no, but it just seems strange to describe the same object that many times. @twiecki probably has thoughts about this too. I do definitely agree with your point about describing the objects more thoroughly.I think we factored this out into individual issues that are mostly addressed now.
Rather than open a billion issues at once, I'm going to open a single issue here. We can break out more substantive issues into separate pieces later if we want.
High Level Thoughts
Docstrings
We have lots of docstrings that look read like this:
While it's awesome that we have docstrings (great job on enforcing this convention!), this isn't actually all that useful to either a user or a maintainer of this function, because it doesn't explain how inputs and outputs are to be represented. For example, we immediately have questions like:
This is uniquely a problem for PyFolio, because nearly all of its APIs are based around
DataFrame
s, which allow for a great deal of flexibility in how one represents data. In many other libraries, identifying the types of inputs/outputs is enough. If I'm using a web framework and a function's docstring says "accepts a GET request objects and returns a Response object", that's probably enough for me to understand how to use the function. This is decidedly not the case for us, and so we need to be careful to explain how we expect inputs and outputs to be structured.Live Trading vs Backtesting
Many of the top-level functions in
tears.py
expect a notion of a "live trading start date". In some cases there are multiple ways to specify this data, and often if the user doesn't specify the input then PyFolio chooses a date for the user. This suggests that PyFolio is only/primarily useful for analyzing simulations which have both in-sample and out-of-sample data. This happens to be precisely the data for which we originally developed these visualizations, but it's mostly not the data that I expect our users to have. It's certainly not data that's currently easy to retrieve on the Quantopian Research Platform. There are a few questions that follow from this:Relationship to Quantopian Research Platform APIs
There are a few "data extraction" functions that take data in the format stored in the
BacktestResult
object on Quantopian and convert it into a format expected by PyFolio for data analysis. I think these functions belong in the Quantopian codebase, not in PyFolio. This is for a few reasons:BacktestResult
, the simplest possible API is for them to just do something likeresult = get_backtest(...); result.show_tearsheet()
. This is only possible if the code that knows how to convert to PyFolio's representation lives inside theBacktestResult
object.Having the data conversion happen inside an internal API also means that PyFolio developers will have more freedom to change tearsheet APIs without breaking existing notebooks that have been shared.
External Data Acquisition
I feel pretty strongly that if you have a returns stream on your local machine, then it should be possible to run at the default tearsheets without an internet connection. In addition to improving the user experience, this has testing and reproducibility benefits. This probably means that we should package the most commonly used benchmarks in the
data
folder (possibly compressed if they're large). The challenge here is that the benchmark data needs to be updated to allow visualizations of backtests/live results with recent data. I don't have a great idea for solving this except for adding some sort of caching layer, or just doing frequent releases with updated benchmarks. Would be curious to hear thoughts from @twiecki and @ehebert on this.Specific Implementation Notes
Some of these are redundant with prose the above. I wrote these notes as I was reading and then went back and summarized. I also read
tears.py
andpos.py
the most thoroughly, since I thinktears.py
is the most important public API, andpos.py
is the next file I read before I realized I was spending too much time in the weeds. I'd like to read the rest of the codebase more thoroughly at some point, but I think @fawce and @KarenRubin will both have aneurysms from the delay that would cause to shipping https://github.com/quantopian/zipline/pull/630...tears.py
create_returns_tearsheet
live_start_date
parameter does from reading the docstring.backtest_days_pct
parameter does. These seem to have the same informational content: why are there two ways of passing this data?data
so that this can run offline?set_plot_defaults
clobbers the user's matplotlib params. We should usematplotlib.style.context
instead.set_plot_defaults
fromcreate_returns_tearsheet
. Should we be doing it for the other plots as well?load_portfolio_risk_factors
, but the only thing we do with that data is pass it to plot_rolling_risk_factors. Can we load the data directly in the function that actually uses it?create_bayesian_tearsheet
create_full_tearsheet
txn.make_transaction_frame
. (This happens in several locations.)pos.py
get_portfolio_alloc
axis=1
instead.get_long_short_pos
gross_lev
argument that's never used. Can we remove it?extract_pos
extract_positions
.get_long_short_pos
gross_lev
argument that's never used. Can we remove it?extract_pos
extract_positions
. Does this even belong in PyFolio? (See above)