API Questions/Issues Meta-Issue

ssanderson commented 9 years ago

Rather than open a billion issues at once, I'm going to open a single issue here. We can break out more substantive issues into separate pieces later if we want.

High Level Thoughts

Docstrings

We have lots of docstrings that look read like this:

def get_portfolio_alloc(positions_vals):
    """
    Determines a portfolio's allocations.

    Parameters
    ----------
    positions_vals : pd.DataFrame
        Contains position values or amounts.

    Returns
    -------
    positions_alloc : pd.DataFrame
        Positions and their allocations.
    """

While it's awesome that we have docstrings (great job on enforcing this convention!), this isn't actually all that useful to either a user or a maintainer of this function, because it doesn't explain how inputs and outputs are to be represented. For example, we immediately have questions like:

What are the expected indices of the input/output frames? At least two reasonable representations would be dates/assets as indices, or a default integer index and columns like ['date', 'value', 'sid'].
What is the expected representation of the actual "values or amounts"? My guess would be that these are meant to be dollar values, indicating the total value allocated in a portfolio to each asset at each point in time, but it would be easy for a reader to think that these are expected to be share counts, as an example.
What is the expected content of the output frame? Does it have the same indices as the input? Are the returned quantities absolute magnitudes? Percentages? Something else entirely?

This is uniquely a problem for PyFolio, because nearly all of its APIs are based around DataFrames, which allow for a great deal of flexibility in how one represents data. In many other libraries, identifying the types of inputs/outputs is enough. If I'm using a web framework and a function's docstring says "accepts a GET request objects and returns a Response object", that's probably enough for me to understand how to use the function. This is decidedly not the case for us, and so we need to be careful to explain how we expect inputs and outputs to be structured.

Live Trading vs Backtesting

Many of the top-level functions in tears.py expect a notion of a "live trading start date". In some cases there are multiple ways to specify this data, and often if the user doesn't specify the input then PyFolio chooses a date for the user. This suggests that PyFolio is only/primarily useful for analyzing simulations which have both in-sample and out-of-sample data. This happens to be precisely the data for which we originally developed these visualizations, but it's mostly not the data that I expect our users to have. It's certainly not data that's currently easy to retrieve on the Quantopian Research Platform. There are a few questions that follow from this:

Is it possible/reasonable to use PyFolio just for analyzing backtests, or just for analyzing live algorithms? I think that the answer to this is yes, but I'm genuinely curious to hear about this.
If the answer to (1) is "Yes", then how can we restructure these APIs to support pure in-sample or pure out-of-sample returns streams?
If the answer to (2) is "No", then I wonder if we should re-evaluate whether it makes sense to distribute the library as widely as we've been planning. The set of people who have lots of backtest data that transitions into live trading data is more or less just our data science team, as far as I can tell. In particular, it seems like this wouldn't be as useful as originally thought on the research platform.
Relationship to Quantopian Research Platform APIs

There are a few "data extraction" functions that take data in the format stored in the BacktestResult object on Quantopian and convert it into a format expected by PyFolio for data analysis. I think these functions belong in the Quantopian codebase, not in PyFolio. This is for a few reasons:

As an open source project, PyFolio shouldn't depend/rely on implementations of non-open-source software.
I think it's more likely that Quantopian will change its internal data format than it is that PyFolio's analysis routines will change their expected input format. Since it's easier to synchronize changes within a single codebase than it is to synchronize across codebases, I'd rather make it Quantopian's responsibility to update the compatibility layer when changes are made to their internal data structures.
Most importantly, if a Quantopian user wants to use PyFolio to make a standard visualization from their BacktestResult, the simplest possible API is for them to just do something like result = get_backtest(...); result.show_tearsheet(). This is only possible if the code that knows how to convert to PyFolio's representation lives inside the BacktestResult object.

Having the data conversion happen inside an internal API also means that PyFolio developers will have more freedom to change tearsheet APIs without breaking existing notebooks that have been shared.

External Data Acquisition

I feel pretty strongly that if you have a returns stream on your local machine, then it should be possible to run at the default tearsheets without an internet connection. In addition to improving the user experience, this has testing and reproducibility benefits. This probably means that we should package the most commonly used benchmarks in the data folder (possibly compressed if they're large). The challenge here is that the benchmark data needs to be updated to allow visualizations of backtests/live results with recent data. I don't have a great idea for solving this except for adding some sort of caching layer, or just doing frequent releases with updated benchmarks. Would be curious to hear thoughts from @twiecki and @ehebert on this.

Specific Implementation Notes

Some of these are redundant with prose the above. I wrote these notes as I was reading and then went back and summarized. I also read tears.py and pos.py the most thoroughly, since I think tears.py is the most important public API, and pos.py is the next file I read before I realized I was spending too much time in the weeds. I'd like to read the rest of the codebase more thoroughly at some point, but I think @fawce and @KarenRubin will both have aneurysms from the delay that would cause to shipping https://github.com/quantopian/zipline/pull/630...

tears.py

create_returns_tearsheet
- It's unclear what the live_start_date parameter does from reading the docstring.
- Is it not possible (or not useful?) to use this tearsheet purely on a backtest. Given that we want people to be able to use this on the Quantopian research platform, which currently only has access to backtest results, it seems important to be able to support pure backtest visualization.
- Similarly, it's unclear what the backtest_days_pct parameter does. These seem to have the same informational content: why are there two ways of passing this data?
- Why are there two separate benchmarks? How are these used?
- Running this tearsheet with default arguments requires an internet connection because it pulls SPY and IEF from Yahoo. Can we package these two ETFs in data so that this can run offline?
- Calling set_plot_defaults clobbers the user's matplotlib params. We should use matplotlib.style.context instead.
- We only call set_plot_defaults from create_returns_tearsheet. Should we be doing it for the other plots as well?
- This is the only function that ever calls load_portfolio_risk_factors, but the only thing we do with that data is pass it to plot_rolling_risk_factors. Can we load the data directly in the function that actually uses it?
create_bayesian_tearsheet
- Lots of magic numbers here (100, 252, **252 - 1). Is there a way to make these clearer?
create_full_tearsheet
- This unconditionally uses 'SPY' and 'IEF' as benchmarks, whereas elsewhere they're parameters. Can we standardize this?
- The docstring references pos.make_transaction_frame, but the function lives in txn.make_transaction_frame. (This happens in several locations.)
  pos.py
get_portfolio_alloc
- We take three transposes unnecessarily. These should use axis=1 instead.
- What are the expected indices on the input/output frames (I assume index=dates, columns=assets?)
get_long_short_pos
- What does "the positions that the strategy takes over time" mean? There are lots of ways that this data could be represented in a dataframe.
- This accepts a gross_lev argument that's never used. Can we remove it?
extract_pos
- This should probably just be called extract_positions.
get_long_short_pos
- What does "the positions that the strategy takes over time" mean? There are lots of ways that this data could be represented in a dataframe.
- This accepts a gross_lev argument that's never used. Can we remove it?
extract_pos
- This should probably just be called extract_positions. Does this even belong in PyFolio? (See above)

ssanderson commented 9 years ago

Still working on this, just didn't want the browser to crash.

ssanderson commented 9 years ago

@twiecki @gusgordon @justinlent these are my main comments from reading today and yesterday.

TL/DR:

Docstrings describing dataframes should say more about the actual structure of the data.
Code the knows things about the BacktestResult object on Quantopian should probably live in the same codebase as BacktestResult (i.e. on Quantopian and/or Zipline if/when it gets ported). In particular I think we should make it easy/possible for people to access PyFolio visualizations on the research platform without actually using or understanding PyFolio at all. Power users should be able to dig into the library internals to do whatever cool custom things they want.
Most users aren't going to have in and out of sample data, but most of our APIs expect it.
Tearsheets probably shouldn't have a runtime dependency on having an internet connection to Yahoo.
We clobber people's existing matplotlib settings.

Most of these comments are negative b/c I'm focusing on the stuff that I think should change and I'm trying to go as fast as possible so that Zipline projects don't fall behind, so I do want to say that I think this is awesome work and I'm super excited to see it in the hands of users.

justinlent commented 9 years ago

I love these ideas @ssanderson . Definitely agree with exposing more informative docstrings, as well as making it more user friendly for people just analyzing backtest data (without live/out-of-sample data) since the tearsheet is still extraordinarily useful for this. Also totally agree that we should make it work in "offline mode" by removing the internet dependency (which is really only used to pull SPY data to compute beta, and the risk factors from the Fama-French site, I believe).

gusgordon commented 9 years ago

Thanks @ssanderson. Few comments/questions:

In the docstrings, do we want to describe each object each time it is used? For example, returns is passed to about 30 functions. Is there a better convention when we have only ~10 things we ever pass, and each function uses a subset of those 10 things? I am guessing no, but it just seems strange to describe the same object that many times. @twiecki probably has thoughts about this too. I do definitely agree with your point about describing the objects more thoroughly.
Yes, the live start date and backtest_days_pct only control a few things - correlations and cone plots is it I believe. We should assume all data is in-sample unless stated otherwise.
I think we should have the research -> pyfolio data conversion functions lie in research. One idea I had was to automatically detect if the user was passing in a certain style of returns, etc. and automatically do the conversion. Makes creating a tear sheet take 1 line instead of 2 in most cases, but may lead to confusion and irregularities.
See 4b2e2d9869ab783fb8b6d0517b35d965200e83c5 — this format could also be used to download default backtest data.

twiecki commented 9 years ago

I think we factored this out into individual issues that are mostly addressed now.

quantopian / pyfolio