microsoft / qlib

Qlib is an AI-oriented quantitative investment platform that aims to realize the potential, empower research, and create value using AI technologies in quantitative investment, from exploring ideas to implementing productions. Qlib supports diverse machine learning modeling paradigms. including supervised learning, market dynamics modeling, and RL.
https://qlib.readthedocs.io/en/latest/
MIT License
15.45k stars 2.64k forks source link

[Underlying Operator] How do you call a method with underlying arguments? #1346

Open xiangqingcs opened 1 year ago

xiangqingcs commented 1 year ago

🌟 Feature Description

project version: 0.8.6 Python version: 3.8 pandas version: 1.3.4 numpy version: 1.21.4

Code Loaction: qlib\data\ops.py L1439 -- L1470 L1458

class Corr(PairRolling):
    """Rolling Correlation

    Parameters
    ----------
    feature_left : Expression
        feature instance
    feature_right : Expression
        feature instance
    N : int
        rolling window size

    Returns
    ----------
    Expression
        a feature instance with rolling correlation of two input features
    """

    def __init__(self, feature_left, feature_right, N):
        super(Corr, self).__init__(feature_left, feature_right, N, "corr")

    def _load_internal(self, instrument, start_index, end_index, *args):
        res: pd.Series = super(Corr, self)._load_internal(instrument, start_index, end_index, *args)

        # NOTE: Load uses MemCache, so calling load again will not cause performance degradation
        series_left = self.feature_left.load(instrument, start_index, end_index, *args)
        series_right = self.feature_right.load(instrument, start_index, end_index, *args)
        res.loc[
            np.isclose(series_left.rolling(self.N, min_periods=1).std(), 0, atol=2e-05)
            | np.isclose(series_right.rolling(self.N, min_periods=1).std(), 0, atol=2e-05)
        ] = np.nan
        return res

Code Loaction: qlib\data\ops.py L1387 -- L1405 L1404

    def _load_internal(self, instrument, start_index, end_index, *args):
        assert any(
            [isinstance(self.feature_left, Expression), self.feature_right, Expression]
        ), "at least one of two inputs is Expression instance"

        if isinstance(self.feature_left, Expression):
            series_left = self.feature_left.load(instrument, start_index, end_index, *args)
        else:
            series_left = self.feature_left  # numeric value
        if isinstance(self.feature_right, Expression):
            series_right = self.feature_right.load(instrument, start_index, end_index, *args)
        else:
            series_right = self.feature_right

        if self.N == 0:
            series = getattr(series_left.expanding(min_periods=1), self.func)(series_right)
        else:
            series = getattr(series_left.rolling(self.N, min_periods=1), self.func)(series_right)
        return series

The core code is as follows: series = getattr(series_left.rolling(self.N, min_periods=1), self.func)(series_right)

I wrote my own validation code 2022-11-10_163924

import pandas as pd

lst_1 = [1, 3, 5, 6, 10, 23]
lst_2 = [10, 31, 15, 7, 9, 3]
series_left = pd.Series(lst_1, index=["A", "B", "C", "D", "E", "F"])  # Set the explicit index
series_right = pd.Series(lst_2, index=["A", "B", "C", "D", "E", "F"])  # Set the explicit index

# Solution One: Wrap test the getattr() method
def run(series_left, func, *args):
    series_2 = getattr(series_left.rolling(6, min_periods=1), func)(*args)
    print("[#######] args [{}], spearman [{}]".format(args, series_2))

run(series_left, "corr", series_right)
run(series_left, "corr", series_right, "spearman")

# Solution Two:Use a separate test for the getattr() method
series = getattr(series_left.rolling(6, min_periods=1), "corr")(series_right)
print("[1] pearson [{}]".format(series))

series = getattr(series_left.rolling(6, min_periods=1), "corr")(series_right, "spearman")
print("[2] spearman [{}]".format(series))

# Solution Third:Call the corr method of the underlying pandas directly
print("[3] pearson [{}]".format(series_left.corr(series_right)))
print("[4] spearman [{}]".format(series_left.corr(series_right, method="spearman")))
print("[5] spearman [{}]".format(series_left.corr(series_right, "spearman")))

Computational Results:

[#######] args [(A    10
B    31
C    15
D     7
E     9
F     3
dtype: int64,)], spearman [A         NaN
B    1.000000
C    0.227901
D   -0.228543
E   -0.378579
F   -0.563069
dtype: float64]

[#######] args [(A    10
B    31
C    15
D     7
E     9
F     3
dtype: int64, 'spearman')], spearman [A         NaN
B    1.000000
C    0.227901
D   -0.228543
E   -0.378579
F   -0.563069
dtype: float64]

[1] pearson [A         NaN
B    1.000000
C    0.227901
D   -0.228543
E   -0.378579
F   -0.563069
dtype: float64]

[2] spearman [A         NaN
B    1.000000
C    0.227901
D   -0.228543
E   -0.378579
F   -0.563069
dtype: float64]

[3] pearson [-0.5630687466927193]
[4] spearman [-0.7714285714285715]
[5] spearman [-0.7714285714285715]
Analysis Result: Code Location Result Output mark Result Output
L28/L32/L40 []/[1]/[3] -0.563069/-0.563069/-0.5630687466927193/
L29/L36/L41/L42 []/[2]/[4]/[5] -0.563069/-0.563069/-0.7714285714285715/-0.7714285714285715

Conclusion:

### The above code of getattr(series_left.rolling(6, min_periods=1), "corr")(series_right, "spearman") has no effect with params of 'spearman', How do I fix this?

Motivation

  1. Application scenario
  2. Related works (Papers, Github repos etc.):
  3. Any other relevant and important information:

Alternatives

Additional Notes

xiangqingcs commented 1 year ago

Pandas method of corr:

    def corr(self, other, method="pearson", min_periods=None) -> float:
        """
        Compute correlation with `other` Series, excluding missing values.

        Parameters
        ----------
        other : Series
            Series with which to compute the correlation.
        method : {'pearson', 'kendall', 'spearman'} or callable
            Method used to compute correlation:

            - pearson : Standard correlation coefficient
            - kendall : Kendall Tau correlation coefficient
            - spearman : Spearman rank correlation
            - callable: Callable with input two 1d ndarrays and returning a float.

            .. warning::
                Note that the returned matrix from corr will have 1 along the
                diagonals and will be symmetric regardless of the callable's
                behavior.
        min_periods : int, optional
            Minimum number of observations needed to have a valid result.

        Returns
        -------
        float
            Correlation with other.

        See Also
        --------
        DataFrame.corr : Compute pairwise correlation between columns.
        DataFrame.corrwith : Compute pairwise correlation with another
            DataFrame or Series.

        Examples
        --------
        >>> def histogram_intersection(a, b):
        ...     v = np.minimum(a, b).sum().round(decimals=1)
        ...     return v
        >>> s1 = pd.Series([.2, .0, .6, .2])
        >>> s2 = pd.Series([.3, .6, .0, .1])
        >>> s1.corr(s2, method=histogram_intersection)
        0.3
        """
        this, other = self.align(other, join="inner", copy=False)
        if len(this) == 0:
            return np.nan

        if method in ["pearson", "spearman", "kendall"] or callable(method):
            return nanops.nancorr(
                this.values, other.values, method=method, min_periods=min_periods
            )

        raise ValueError(
            "method must be either 'pearson', "
            "'spearman', 'kendall', or a callable, "
            f"'{method}' was supplied"
        )

How do you add multiple arguments to an operator in Qlib ops? Such as the following parameters with method of corr. method="pearson", min_periods=None