Closed shobrook closed 6 years ago
Derivative Analysis -- Change in sentiment from day to day. More days == more smoothing.
Integral Analysis -- Area under the curve. Shows accumulated sentiment. Higher number means more general positive sentiment over a period of time.
I know what derivatives and integrals are haha. I'm saying that sentiment is a categorical variable, not a continuous one, so the derivative will always be zero.
We're looking at it over a period of time (we also need to use some regression testing to pick the best length), so the derivative won't be zero, right?
No, it'll be zero. When you plot sentiment over time, it's going to be a series of horizontal lines at different y-values.
But I just realized this isn't a problem if we use the weighted average sentiment values (since they're continuous).
So disregard :^)
So, first step is going to be estimating a function for the average sentiment over each x day period, and then either:
For 1, this SO post pretty much covers it. Not 100% sure what to do about 2 yet.
Just committed the integral analysis. Opening PR, but trying to get derivative done tonight too.
Current thinking on 2. -> We have the data in the form of a list of average daily sentiment scores. I wrote a function to split them into a list of lists that would represent a sliding window. I could write a function to estimate a function for each list and then take the derivative at each day and average them from that.
I haven't seen any libraries that have this capability yet, but I'm going to keep poking around for a little bit.
Edit
https://docs.scipy.org/doc/scipy/reference/generated/scipy.optimize.curve_fit.html
This seems like a decent option. This is my current working version, based mostly on the scipy examples. I don't understand where the ydata is supposed to come from. Ideas?
def func(x, a, b, c):
"""https://docs.scipy.org/doc/scipy/reference/generated/scipy.optimize.curve_fit.html"""
return a * np.exp(-b * x) + c
def best_fit_curve(xdata):
"""Takes list of sentiment scores as input and returns a function of best fit.
https://docs.scipy.org/doc/scipy/reference/generated/scipy.optimize.curve_fit.html"""
# Fit for the parameters a, b, c of the function func
popt, pcov = curve_fit(func, xdata, ydata)
# Constrain the optimization to the region of 0 <= a <= 3, 0 <= b <= 1 and 0 <= c <= 0.5:
popt, pcov = curve_fit(func, xdata, ydata, bounds=(0, [3., 1., 0.5]))
def sliding_window_avg_derivative(avg_daily_sentiment, interval):
"""Takes list of average daily sentiment scores and returns a list of average derivatives."""
# Split into sliding window list of lists
sentiment_windows = sliding_window(avg_daily_sentiment, interval)
# Fit a function to each of the lists in the list.
best_fit_funcs = []
for window in sentiment_windows:
best_fit_funcs.append(best_fit_curve(window))
# Take derivative at each day in the interval and average them.
# -> Transform each function into a list of days-in-interval-many derivatives.
# -> Then take one average for each list in sentiment_windows.
# Return list of avg_derivatives
The xdata is time, the ydata is sentiment.
Okay, so... let's ditch the derivative analysis and just do rate of change :^)
Actually, since we're adding lag variables (i.e. we'll have sentiment at t, t-1, t-2, and t-3 in our feature set), is the derivative analysis really necessary? It's meant to show how sentiment is changing over time, but the lag variables already do that.
I'm gonna close this for now.
@alichtman Are these even the right transforms for this kinda data? Did you tell your Dad that sentiment was a nominal value?