mjockers / syuzhet

An R package for the extraction of sentiment and sentiment-based plot arcs from text
334 stars 72 forks source link

Add padding to pre-fft sentiment and preserve the structure of the fft results in get_transformed_values. #10

Closed tmmcguire closed 9 years ago

tmmcguire commented 9 years ago

This patch should address two problems:

  1. The current get_transformed_values produces a time-aliasing artefact because the incoming sentiment data is not padded to be long enough to hold the entire result of the fft/filter/inverse fft. (This is the reason the two ends of the resulting curve are forced to have the same y-value.) This patch adds a padding factor, with padding based on the length of the sentiment data. A factor of 1.25 seems a reasonable default.
  2. This patch also modifies the low-pass filter step to preserve the frequency spectrum structure of the filtered frequency data. Without this, apparently the inverse_values are not guaranteed to have insignificant imaginary components, although it does not seem to cause problems in practical use.
tmmcguire commented 9 years ago

See http://www.crsr.net/files/Exploring_Syuzhet.html for the long version of the story.

Note: I haven't been able to fully test the actual patch: I'm getting errors like " Error in FUN(X[[1L]], ...) : object 'bing' not found" in RStudio when I try to call get_sentiment.

Also, I have no idea how or if I need to update all the associated files in the project.

tmmcguire commented 9 years ago

I have updated the pull request to increase the default padding size to 2 (more seems better, to avoid time aliasing when running the inverse FFT) and to change the low_pass_size to match the amount of padding. See the post Syuzhet: Prodding the Frequency Domain (or the R notebook attached to it) for details.

mjockers commented 9 years ago

Thanks Tommy. Merged your changes into the master.

tmmcguire commented 9 years ago

Thanks! Please let me know if there's any feedback on it; I'd like to learn more.