quantopian / pyfolio

Portfolio and risk analytics in Python
https://quantopian.github.io/pyfolio
Apache License 2.0
5.62k stars 1.76k forks source link

API Questions/Issues Meta-Issue #75

Closed ssanderson closed 9 years ago

ssanderson commented 9 years ago

Rather than open a billion issues at once, I'm going to open a single issue here. We can break out more substantive issues into separate pieces later if we want.

High Level Thoughts

Docstrings

We have lots of docstrings that look read like this:

def get_portfolio_alloc(positions_vals):
    """
    Determines a portfolio's allocations.

    Parameters
    ----------
    positions_vals : pd.DataFrame
        Contains position values or amounts.

    Returns
    -------
    positions_alloc : pd.DataFrame
        Positions and their allocations.
    """

While it's awesome that we have docstrings (great job on enforcing this convention!), this isn't actually all that useful to either a user or a maintainer of this function, because it doesn't explain how inputs and outputs are to be represented. For example, we immediately have questions like:

This is uniquely a problem for PyFolio, because nearly all of its APIs are based around DataFrames, which allow for a great deal of flexibility in how one represents data. In many other libraries, identifying the types of inputs/outputs is enough. If I'm using a web framework and a function's docstring says "accepts a GET request objects and returns a Response object", that's probably enough for me to understand how to use the function. This is decidedly not the case for us, and so we need to be careful to explain how we expect inputs and outputs to be structured.

Live Trading vs Backtesting

Many of the top-level functions in tears.py expect a notion of a "live trading start date". In some cases there are multiple ways to specify this data, and often if the user doesn't specify the input then PyFolio chooses a date for the user. This suggests that PyFolio is only/primarily useful for analyzing simulations which have both in-sample and out-of-sample data. This happens to be precisely the data for which we originally developed these visualizations, but it's mostly not the data that I expect our users to have. It's certainly not data that's currently easy to retrieve on the Quantopian Research Platform. There are a few questions that follow from this:

  1. Is it possible/reasonable to use PyFolio just for analyzing backtests, or just for analyzing live algorithms? I think that the answer to this is yes, but I'm genuinely curious to hear about this.
  2. If the answer to (1) is "Yes", then how can we restructure these APIs to support pure in-sample or pure out-of-sample returns streams?
  3. If the answer to (2) is "No", then I wonder if we should re-evaluate whether it makes sense to distribute the library as widely as we've been planning. The set of people who have lots of backtest data that transitions into live trading data is more or less just our data science team, as far as I can tell. In particular, it seems like this wouldn't be as useful as originally thought on the research platform.

    Relationship to Quantopian Research Platform APIs

There are a few "data extraction" functions that take data in the format stored in the BacktestResult object on Quantopian and convert it into a format expected by PyFolio for data analysis. I think these functions belong in the Quantopian codebase, not in PyFolio. This is for a few reasons:

  1. As an open source project, PyFolio shouldn't depend/rely on implementations of non-open-source software.
  2. I think it's more likely that Quantopian will change its internal data format than it is that PyFolio's analysis routines will change their expected input format. Since it's easier to synchronize changes within a single codebase than it is to synchronize across codebases, I'd rather make it Quantopian's responsibility to update the compatibility layer when changes are made to their internal data structures.
  3. Most importantly, if a Quantopian user wants to use PyFolio to make a standard visualization from their BacktestResult, the simplest possible API is for them to just do something like result = get_backtest(...); result.show_tearsheet(). This is only possible if the code that knows how to convert to PyFolio's representation lives inside the BacktestResult object.

Having the data conversion happen inside an internal API also means that PyFolio developers will have more freedom to change tearsheet APIs without breaking existing notebooks that have been shared.

External Data Acquisition

I feel pretty strongly that if you have a returns stream on your local machine, then it should be possible to run at the default tearsheets without an internet connection. In addition to improving the user experience, this has testing and reproducibility benefits. This probably means that we should package the most commonly used benchmarks in the data folder (possibly compressed if they're large). The challenge here is that the benchmark data needs to be updated to allow visualizations of backtests/live results with recent data. I don't have a great idea for solving this except for adding some sort of caching layer, or just doing frequent releases with updated benchmarks. Would be curious to hear thoughts from @twiecki and @ehebert on this.

Specific Implementation Notes

Some of these are redundant with prose the above. I wrote these notes as I was reading and then went back and summarized. I also read tears.py and pos.py the most thoroughly, since I think tears.py is the most important public API, and pos.py is the next file I read before I realized I was spending too much time in the weeds. I'd like to read the rest of the codebase more thoroughly at some point, but I think @fawce and @KarenRubin will both have aneurysms from the delay that would cause to shipping https://github.com/quantopian/zipline/pull/630...

tears.py

ssanderson commented 9 years ago

Still working on this, just didn't want the browser to crash.

ssanderson commented 9 years ago

@twiecki @gusgordon @justinlent these are my main comments from reading today and yesterday.

TL/DR:

Most of these comments are negative b/c I'm focusing on the stuff that I think should change and I'm trying to go as fast as possible so that Zipline projects don't fall behind, so I do want to say that I think this is awesome work and I'm super excited to see it in the hands of users.

justinlent commented 9 years ago

I love these ideas @ssanderson . Definitely agree with exposing more informative docstrings, as well as making it more user friendly for people just analyzing backtest data (without live/out-of-sample data) since the tearsheet is still extraordinarily useful for this. Also totally agree that we should make it work in "offline mode" by removing the internet dependency (which is really only used to pull SPY data to compute beta, and the risk factors from the Fama-French site, I believe).

gusgordon commented 9 years ago

Thanks @ssanderson. Few comments/questions:

twiecki commented 9 years ago

I think we factored this out into individual issues that are mostly addressed now.