Add the integral and rate-of-change of sentiment scores to the feature set

shobrook commented 6 years ago

@alichtman Are these even the right transforms for this kinda data? Did you tell your Dad that sentiment was a nominal value?

alichtman commented 6 years ago

Derivative Analysis -- Change in sentiment from day to day. More days == more smoothing.

Integral Analysis -- Area under the curve. Shows accumulated sentiment. Higher number means more general positive sentiment over a period of time.

shobrook commented 6 years ago

I know what derivatives and integrals are haha. I'm saying that sentiment is a categorical variable, not a continuous one, so the derivative will always be zero.

alichtman commented 6 years ago

We're looking at it over a period of time (we also need to use some regression testing to pick the best length), so the derivative won't be zero, right?

shobrook commented 6 years ago

No, it'll be zero. When you plot sentiment over time, it's going to be a series of horizontal lines at different y-values.

shobrook commented 6 years ago

But I just realized this isn't a problem if we use the weighted average sentiment values (since they're continuous).

shobrook commented 6 years ago

So disregard :^)

alichtman commented 6 years ago

So, first step is going to be estimating a function for the average sentiment over each x day period, and then either:

Estimate definite integral from t -> t+x, where t+x is the most recent day in the analysis window. I'm going to use Simpson's Method.
Take some average of the derivatives in the analysis window

For 1, this SO post pretty much covers it. Not 100% sure what to do about 2 yet.

alichtman commented 6 years ago

Just committed the integral analysis. Opening PR, but trying to get derivative done tonight too.

Current thinking on 2. -> We have the data in the form of a list of average daily sentiment scores. I wrote a function to split them into a list of lists that would represent a sliding window. I could write a function to estimate a function for each list and then take the derivative at each day and average them from that.

I haven't seen any libraries that have this capability yet, but I'm going to keep poking around for a little bit.

Edit

https://docs.scipy.org/doc/scipy/reference/generated/scipy.optimize.curve_fit.html

This seems like a decent option. This is my current working version, based mostly on the scipy examples. I don't understand where the ydata is supposed to come from. Ideas?

def func(x, a, b, c):
    """https://docs.scipy.org/doc/scipy/reference/generated/scipy.optimize.curve_fit.html"""
    return a * np.exp(-b * x) + c

def best_fit_curve(xdata):
    """Takes list of sentiment scores as input and returns a function of best fit.
    https://docs.scipy.org/doc/scipy/reference/generated/scipy.optimize.curve_fit.html"""

    # Fit for the parameters a, b, c of the function func
    popt, pcov = curve_fit(func, xdata, ydata)

    # Constrain the optimization to the region of 0 <= a <= 3, 0 <= b <= 1 and 0 <= c <= 0.5:
    popt, pcov = curve_fit(func, xdata, ydata, bounds=(0, [3., 1., 0.5]))

def sliding_window_avg_derivative(avg_daily_sentiment, interval):
    """Takes list of average daily sentiment scores and returns a list of average derivatives."""

    # Split into sliding window list of lists
    sentiment_windows = sliding_window(avg_daily_sentiment, interval)

    # Fit a function to each of the lists in the list.
    best_fit_funcs = []
    for window in sentiment_windows:
        best_fit_funcs.append(best_fit_curve(window))

    # Take derivative at each day in the interval and average them.
    # -> Transform each function into a list of days-in-interval-many derivatives.
    # -> Then take one average for each list in sentiment_windows.

    # Return list of avg_derivatives

shobrook commented 6 years ago

The xdata is time, the ydata is sentiment.

shobrook commented 6 years ago

Okay, so... let's ditch the derivative analysis and just do rate of change :^)

shobrook commented 6 years ago

Actually, since we're adding lag variables (i.e. we'll have sentiment at t, t-1, t-2, and t-3 in our feature set), is the derivative analysis really necessary? It's meant to show how sentiment is changing over time, but the lag variables already do that.

I'm gonna close this for now.

shobrook / BitVision

Add the integral and rate-of-change of sentiment scores to the feature set #50