Support for Custom features & Multiple tickers

QAQOAO commented 3 years ago

Hi, awesome job! This repo really fits my needs.

Based on my initial understanding, I would like to ask two questions.

From the given example, I think the dataframe format must be DATE + OHLCV.

However, I think the OHLCV columns are just the input of technical indicators, so if users are able to make some custom features by themselves (generated from OHLCV as well but are not library built-in technical indicators) and just want to see how these custom features correlates with a target variable, is this possible? Please give a short code example if it can be already be done.

From the given example, I think the tuneta optimizes correlation one ticker a time.

Thus, if users want to calculate correlation of multiple tickers, users can only execute the example multiple times by changing data with different tickers, but the features sets suggested by tuneta may not be the same.

As a result, I think it is nice to have tuneta can receive a dataframe format like below, so the importance of a feature can be evaluated by not only multi-time period but also multiple tickers. Also, the suggested features can be the same so users can use the data instantly after running tuneta once.

Finally, I'm not sure whether the idea is viable. Please correct me if my understanding is wrong.

  date       ticker feature_a, feature_b, feature_c

2019-01-05 A ... ... ... 2019-01-05 AAL ... ... ... 2019-01-05 AAPL ... ... ... 2019-01-06 A ... ... ... 2019-01-06 AAL ... ... ... 2019-01-06 AAPL ... ... ... 2019-01-07 A ... ... ... 2019-01-07 AAL ... ... ... 2019-01-07 AAPL ... ... ... ... ... ... ... ... 2020-01-07 A ... ... ... 2020-01-07 AAL ... ... ... 2020-01-07 AAPL ... ... ...

Thanks a lot!

jmrichardson commented 3 years ago

Currently, it does not support custom functions but that is a good suggestion. My time is limited but will have a look sometime soon to put together an example. Do you have a sample custom feature that I could use?

I am not sure I am following you on question 2.. Do you mean applying indicators on each ticker but then be able to prune based on correlation based on all the indicators to a single target variable. For example, each ticker is a subset of S&P and the target is based on S&P, you would like apply indicators on the entire set of tickers and then select/report based on all?

QAQOAO commented 3 years ago

For question 1, I think we should let users calculate custom features on their own, so we don't have to deal with custom functions since there are too many to implement.

All we need is to take their calculated custom features just as the input of OHLCV and optimize them directly without computing technical indicators like before.

Take LPPLS as a sample custom feature example, but it can also be arbitrary.

https://github.com/Boulder-Investment-Technologies/lppls

This LPPLS model takes close price as input and then outputs positive and negative bubble probabilities, so there are two columns of features added.

For question 2, what I want is just to expand the current version into a multiple tickers version.

The common situation is that I have a universe of 500 stocks and their OHLCV data, and I want all of the tickers to have the same recommended feature sets by training the tuneta once.

Then, I calculate their "percent_return" as their target variable respectively, for instance.

I prepare my data frame as I said above, meaning that the tuneta should be able to identify the "ticker" column to optimize when there are multiple stocks simultaneously.

Finally, I run your tuneta example to obtain important feature sets for all 500 stocks.

Thanks for the quick response!

jmrichardson commented 3 years ago

Unfortunately, tuneta is not designed that way in that it tries to optimize the parameters of the indicators/functions according to correlation. Those optimized functions are then persisted for future data sets. So, in your example, if you provide a dataframe with already existing features, tuneta could pass those through to be pruned via correlation, however, adding those same features for test and live production data will not work as it is external to tuneta. This breaks the persistence feature which is important for deploying tuneta to production.

Tuneta could be updated to handle custom features as part of the optimization process. So for example, you could put a wrapper around the lppls, essentially a function with or without arguments, and it will be passed like any of the other packages tuneta supports. However, since lppls is a fitted model, itself would need to maintain state in its definition. Ie, the fit of lppls would need to be available for transform for test and production data. If this sounds interesting, I could put this on the todo list. I was thinking about this as a way to add other packages that don't have functional definitions such as stockstats

Clarifying questions for #2,

Of the 500 stocks, tuneta would apply the indicators and optimize each ticker individually for it's respective target variable? How would correlation pruning work? Individually or via the whole set? Currently, tuneta has a correlation/fitness for each feature with respect to it's target for a single stock. Iteratively, strongly intercorrelated features are removed. Would this just happen for each stock in the universe of stocks? Sorry, having a hard time visualizing how this works versus just doing a parallelized loop for each of the 500 stocks. Then using the returned fitted 500 objects to create the dataset of all 500.

QAQOAO commented 3 years ago

Got it! So in order to make tuneta work for custom feature, we need to provide it with respective function for it to optimize, don't we?

Yeah, it definitely sounds interesting. Currently, lppls can be solved by the solution you said. But in general, I think tuneta might need to offer something like torch Dataset class (our case : function) so that users can inherit from it and customize it. Is this idea useful?

Alright, I think doing a loop for each of the 500 stocks is ok, but what if the suggested feature set is different so that some feature column would be NaN if we concatenate 500 dfs into a large one with date and ticker as index and feature set as columns afterwards? For example, MACD is in the suggested feature set of ticker A and C, but not in ticker B and D. As a result, we can only train a model for each ticker one by one. If this situation can be avoided, then question 2 is solved.

jmrichardson commented 3 years ago

Yes, the custom feature would need to have a function definition to understand the parameters required to tune it. I think the simplest approach would just be to just create a module with function(s) of the custom features. For example, tuneta could try to import "custom" module if it exists (keeping name generic so that it will look for this module only) and including any functions in the module as tuneable indicators.

I think I understand what you are looking for in the second one, you want to create indicator(s) that optimizes on all of the tickers as a whole. For example, if we are looking for only RSI, you want to tune it such that the average correlation of all tickers are maximized. In the end, you would have a single RSI feature? If this is what you are looking for, I haven't thought about it in this way before as a technical indicator is designed to run and be tuned on a single series. Tuning the parameter(s) across multiple series is interesting...

If you did the loop method, the result would be a RSI feature for each of the tickers in a sparse dataframe (NANs for all except 1 ticker).

QAQOAO commented 3 years ago

Yeah, if I am looking for only RSI, the returned feature set can be identical. But actually, what I am looking for the most is described as follows.

First, I prepare a dataframe with OHLCV data for each ticker, and do the loop method. Since I am not sure how many technical indicators in TA-Lib work, I set indicators=['tta'] to let tuneta choose for me. Then, I try to concatenate 500 dataframes (only added feature columns are used) to get ready for training a universe market prediction model.

In sum, I want the tuneta to try all indicators for all tickers and only select the best features, and at the same time the suggested features of 500 tickers are the same accoss all tickers so there would not be many NaNs when concatenating them (except for NaNs like the first 14 rows of RSI). It seems that now the idea can not be implemented, but I hope you can take into consideration because I think this feature could be a really strong one if supported.

Thank you for your patience.

jmrichardson commented 3 years ago

Let me look into this. I have limited time but will see if it is possible.

QAQOAO commented 3 years ago

OK, take your time.

I have another awesome technical analysis library that is relatively easy to use as well.

It has some indicators that other libraries do not have, so it may be a good complement to other libraries that are supported currently.

You can check it out if you have time.

jmrichardson commented 3 years ago

@QAQOAO

Sorry for the late response on this one as I really didn't fully understand the ask until now (especially since I am running into the same exact issues myself for both of your asks).

I will add an external function in tuneta and provide some feedback as to how to use it use for both target correlation and intercorrelation.
I also need the same functionality as well (I have been using qlib and notice they have a very similar dataset as you described. Actually, I am not going to use qlib going forward as I think it adds too many layers of complexity where not needed but it is a great package). I have another repo that I created creates the dataset you described but when I tried to use TuneTA I ran into the exact same issues as well (ie different column names and others).

So, I need to spend some time to determine how to architect the solution into tuneta. At the moment, I think I need to be able to process the entire dataset as a whole accounting for how TA indicators need leading data to calculate. I may need to create another repo to handle just this use case as it may introduce some issues but will to keep it as one package.

I just recently made some major changes to tuneta such as improving on the parameter selection methodology as well as a new correlation metric. Especially since I need both 1 and 2 of your feature request, I will definitely be working to add them :)

Also, have you been able to use any other package to accomplish this goal?

jmrichardson commented 3 years ago

The main branch now has supports both existing features and multiple tickers. See readme

jmrichardson / tuneta

Support for Custom features & Multiple tickers #3