Label the darn axes, NO BAD IDEAS

jwzimmer-zz commented 3 years ago

From trying to come up with what visuals I want in the paper, it has become clear I absolutely can't avoid labeling the axes anymore. I keep not doing it because I'm worried I'll do it wrong. So this is the No Bad Ideas version. If it's stupid I'm sure Dodds will let me know.

Basic idea: Which traits are most important to each "dimensions"? That will be those traits which have the most extreme weights in each ROW of V. Which characters best exemplify each "dimension"? That will be the characters which have the most extreme weights in each COLUMN of U. How much more important is the first "dimension" compared to the second? That is given by the relevant WEIGHT in Sigma.

jwzimmer-zz commented 3 years ago

Want to make: lists/ word clouds based on the traits which have the most positive, most neutral, and most negative weights in each of the first 3 dimensions -- this should lead to a D&D style alignment chart (3x3) which will hopefully show a clear pattern? Maybe also do with characters?

Pseudocode:

Need the SVD results (means removed).
Specifically, need rows of V.
For the first three rows:
- For each row, sort traits by most positive; most neutral; most negative
- Save the sorted lists so they can be truncated at different points
Make a wordcloud for each of those categories
- Assemble into 3x3 grid
- Look for patterns, if its legible
- If it's not legible, make the list shorter

Using this tutorial: https://towardsdatascience.com/how-to-make-word-clouds-in-python-that-dont-suck-86518cdcb61f

Saving visualizations here as I make them so I can hopefully tell/ remember what they are in the future: https://docs.google.com/presentation/d/1_kc36iI6B2OmsZlbMxLB0xiT0ePaQQ2qh7NykKqsefI/edit?usp=sharing

I'm using this function in the file nextstep.py to make very basic word clouds, giving the wordcloud python package the scores for each trait in each row of V as if they were "frequencies" even though they are not and letting the built in function generate_from_frequencies interpret that however it may.

def simple_wordcloud(matrix_array,num_row,item_names):
    #assuming np array, and interested in rows (not columns)
    matrix_row = matrix_array[num_row,:]
    matrix_dict = dict(zip(item_names,matrix_row))

    # change the value to black
    def black_color_func(word, font_size, position,orientation,random_state=None, **kwargs):
        return("hsl(0,100%, 1%)")
    # set the wordcloud background color to white
    # set max_words to 1000
    # set width and height to higher quality, 3000 x 2000
    wordcloud = WordCloud(background_color="white", width=3000, height=2000, max_words=500).generate_from_frequencies(matrix_dict)
    # set the word color to black
    wordcloud.recolor(color_func = black_color_func)

    plt.imshow(wordcloud)
    return wordcloud

This seems to cause spyder to crash pretty often, so I made it lower quality and take fewer words:

def simple_wordcloud(matrix_array,num_row,item_names):
    #assuming np array, and interested in rows (not columns)
    matrix_row = matrix_array[num_row,:]
    matrix_dict = dict(zip(item_names,matrix_row))

    # change the value to black
    def black_color_func(word, font_size, position,orientation,random_state=None, **kwargs):
        return("hsl(0,100%, 1%)")
    # set the wordcloud background color to white
    wordcloud = WordCloud(background_color="white", max_words=300).generate_from_frequencies(matrix_dict)
    # set the word color to black
    wordcloud.recolor(color_func = black_color_func)

    plt.imshow(wordcloud)
    return wordcloud

For the first 6 rows of V, that results in:

For reference, the weights in the relevant sigma matrix are as follows, they get pretty small by at least e.g. 15 in: [4571.60069027, 3977.77079978, 3148.95421275, 2330.72490479, 1863.71976093, 1422.81288847, 1389.55887554, 1311.97892059, 1024.52207029, 924.99578844, 890.04169256, 774.10750006, 728.56171378, 665.35637138, 607.12742436, 592.00578495, 567.349024 , 517.6459256 , 504.80145181, 496.07731617, 482.19264758, 476.40215009, 450.10001708, 430.57116746, 419.4340081 , 409.40418421, 406.91557611, 394.68135566, 385.86736628, 377.25202325, 372.82985457, 356.41834577, 351.72495156, 347.74900228, 339.38741564, 333.79399487, 326.55896904, 324.55117824, 318.1295393 , 315.10346038, 308.25490266, 299.26762091, 295.0152497 , 289.96992691, 288.85032287, 281.9690584 , 276.58233643, 272.43157464, 271.90675683, 266.18190642, 263.66959715, 259.37314041, 256.60410545, 254.70809149, 252.68163905, 247.4687916 , 245.9929314 , 244.85792913, 241.82939261, 239.93695879, 235.41983321, 231.65486434, 230.66525203, 226.50847219, 226.02510515, 224.1026829 , 221.54346153, 218.66355964, 216.47694732, 215.91077504, 215.1921562 , 213.26036446, 211.09757665, 208.67226747, 206.33412318, 203.46306598, 202.00129305, 198.56407551, 197.94191631, 196.92701143, 195.12457197, 192.86726501, 190.96810407, 189.87703712, 189.53880978, 189.06950455, 188.50868644, 185.43954795, 182.61832651, 181.50209107, 179.99456989, 178.44065033, 177.4464131 , 176.71970982, 175.55113009, 174.82536567, 172.5053995 , 171.26372319, 170.70552398, 168.26458816, 167.98707707, 165.43564178, 165.08935084, 164.83722953, 162.13498 , 161.42803178, 160.30528848, 159.49239512, 159.21142423, 158.28515706, 157.13243679, 155.45907298, 153.86700243, 153.59045706, 152.35155954, 150.58777948, 149.58254526, 149.01649307, 148.12937946, 146.70195903, 145.70874362, 144.44385711, 143.42005057, 142.91038791, 141.60808627, 141.4631097 , 140.21726391, 139.21397298, 137.97307267, 137.48926772, 136.25779283, 135.36367309, 134.63421905, 133.13706912, 132.23788945, 130.99681122, 130.49813038, 129.36471842, 129.21269304, 128.19432229, 126.64128126, 126.28955773, 125.6550039 , 124.83269046, 124.14202611, 122.74294555, 120.90210089, 120.42513441, 119.67430339, 119.42790495, 119.24266103, 117.64425693, 117.36405301, 116.18928162, 115.39920124, 114.76582936, 113.78433957, 113.4101737 , 112.08557423, 111.10900704, 110.64820963, 110.17308651, 109.86539458, 107.9222643 , 107.68943644, 106.66859772, 105.97305812, 105.54842185, 104.91505923, 103.6099165 , 102.85102213, 102.2707889 , 101.34768192, 101.09396798, 100.60069694, 99.77079538, 98.96423342, 98.31576173, 97.81404952, 96.80348288, 96.30542631, 95.57328745, 95.13722686, 93.98035775, 93.2769668 , 92.75871829, 92.42652245, 91.92542246, 91.05170611, 90.01036083, 89.7513737 , 89.22258541, 88.78020526, 88.65292871, 87.40167041, 86.49717578, 85.02984127, 84.81686455, 84.40993647, 82.99396525, 82.35567233, 81.60991198, 81.36376152, 79.95434487, 79.39810207, 79.08318183, 77.83822367, 77.22776508, 76.30862441, 75.47880711, 74.9228648 , 74.77301107, 73.84800751, 73.60236366, 72.87570326, 72.38778495, 71.67473456, 70.54334797, 69.59775162, 69.28375765, 68.05775428, 67.04052598, 65.98883931, 64.82344704, 64.49868912, 64.02335036, 63.06458598, 62.70576773, 62.01116387, 60.33286102, 59.29124147, 58.27457299, 56.17688565, 55.46391112, 53.41931643, 48.49385579]

Moving on to a more D-D like chart...

def simple_wordcloud(matrix_dict):
    # change the value to black
    def black_color_func(word, font_size, position,orientation,random_state=None, **kwargs):
        return("hsl(0,100%, 1%)")
    # set the wordcloud background color to white
    wordcloud = WordCloud(background_color="white", max_words=300).generate_from_frequencies(matrix_dict)
    # set the word color to black
    wordcloud.recolor(color_func = black_color_func)
    plt.imshow(wordcloud)
    return wordcloud

def make_dd_wordcloud_dicts(matrix_array,num_row,item_names):
    #assuming np array, and interested in rows (not columns)
    matrix_row = matrix_array[num_row,:]
    matrix_dict = dict(zip(item_names,matrix_row))
    #sorted from most negative to least negative, ascending
    sorted_md = {k: v for k, v in sorted(matrix_dict.items(), key=lambda item: item[1])}
    traits_list = list(sorted_md.keys())
    scores_list = list(sorted_md.values())

    dict1 = dict(zip(traits_list[:89],scores_list[:89]))
    dict2 = dict(zip(traits_list[89:178],scores_list[89:178]))
    dict3 = dict(zip(traits_list[178:],scores_list[178:]))
    return dict1,dict2,dict3

Call like this for e.g. the 3rd row of V with means removed: d1_2,d2_2,d3_2 = make_dd_wordcloud_dicts(V2,2,col2), then do simple_wordcloud(d3_2) (etc.).

Produces this for the first 3 rows of V with means removed, dividing the traits into ordered 3rds... the first 89 with the most negative/ least positive, then the next 89 middle words, then the final 88 most positive words, those get put into 3 dicts by the above function; the way they are ordered in the chart is the opposite, the MOST POSITIVE at the top, etc. Maybe that isn't a good way to do that, because the spread isn't the same for every row of V, so I should come up with a new strategy for that?

jwzimmer-zz commented 3 years ago

Other visualization ideas

Ranking turbulence kind of like the allotaxonographs, but simpler?
Just a list of the top N traits by size of value in the rows of V
Can I do something like: the character that is best described by the first N rows of V, by subtracting the difference between the rebuilt matrix approximation versus the actual scores for that character?
What about... which traits' scores are most closely approximated by the first N reconstructed matrix? What would that indicate... that most of the information about that trait is completely captured by the first N dimensions, I think?
Split into positive and negative magnitude traits per row of V, rank them, and then do allotaxonomographs to see how the traits move around from 1 dimension to the next?

jwzimmer-zz commented 3 years ago

Meeting with Dodds notes

dot product of character vector with the row of v, then sort by cos
[x] change this: subtract the mean of all the scores, then re-run SVD --> svd output saved in https://github.com/jwzimmer/tv-tropening/commit/eb8244e8294f4ea1540384603aa2140d281696b5
[x] look at the distribution of the means --> https://github.com/jwzimmer/tv-tropening/issues/12#issuecomment-927180869
what is the mean? about 49.65
do it both ways
[x] also try subtracting just 50? --> svd output saved in https://github.com/jwzimmer/tv-tropening/commit/ea4f71747d769c13f58326405f5f1895cc4880f4

jwzimmer-zz commented 3 years ago

Looking at how the values in the rows of V2 (V^T) are distributed (per above comment/ conversation), using the version of V2 from running SVD with the overall mean (49.65 ish) removed via e.g. plt.scatter(range(1,237),V2[21,:]).

First row of V

Second row of V

Third row of V

Actually I think this might be easier to see as a scatter plot?

First row of V

Second row of V

Third row of V

Fourth row of V

Fifth row of V

Sixth row of V

Seventh row of V

Eighth row of V

Ninth row of V

Tenth row of V

11th row of V

17th row of V

22nd row of V

27th row of V

52nd row of V

77th row of V

102nd row of V

202nd row of V

236th row of V (last row)

My Interpretation

There's a change in the way the weights are distributed from the 1-3rd to the 4th+ rows of V.
- Most of the traits are relevant to the first 3 dimensions (because they're scattered fairly randomly)
- After that, most of the traits tend to cluster around 0, and a few have outlying values by comparison
- Maaaybe there's a trend that as the dimensions increase, they're more likely to be dominated by a single or fewer traits (tighter clustering for most around 0, and larger values for the outlying scores)?
It might be a good idea to go through some of the dimensions that ARE dominated by only 1 or 3 traits, and seeing which traits those are and if there's a pattern to those?

jwzimmer-zz commented 3 years ago

Looking at trait magnitude in the rows of V (V2), with the overall mean removed --> https://github.com/jwzimmer/tv-tropening/commit/1c5699de65d0b12960c514ece5e827d5706a5e4c

Using this code to render the bar charts:

def vector_barchart(vector_names,vector,n,style="by_mag",ascending=False):
    """ vector_names should be the labels for the values in the vector
        vector should be the vector (ndarray)
        n should be the number of values you want displayed in the chart
        style should be the format of the chart
        ascending=False will be most relevant traits by magnitude,
        ascending=True will be least relevant traits by magnitude"""
    n=min(n,len(vector_names))
    vectordf = pd.DataFrame()
    vectordf["Trait"] = col2
    vectordf["Values"] = vector

    if style=="by_mag":
        vectordf["Magnitude"] = vectordf.apply(lambda row: abs(row["Values"]), axis = 1)
        sorteddf = vectordf.sort_values(by="Magnitude",ascending=ascending)
        #plotguy = sorteddf.iloc[-2*n:].iloc[::-1]
        plotguy = sorteddf.iloc[0:2*n]
    sns.barplot(plotguy["Values"],plotguy["Trait"])
    return vectordf, plotguy

Interpretation: there do seem to be some patterns in the lower dimensions with "outlying" traits, e.g. a sort of leadership style component (captain<->first-mate seems to come up a lot) and a physical component (thick<->thin and tall<->short). And, in the lower dimensions, gender, sexuality, and procreation seem to come up a lot. To make a chart for a specific row of V, use vector_barchart(col2,V2[26,:],10,style="by_mag",ascending=False) (that's the 27th row with 10 traits shown).

jwzimmer-zz commented 3 years ago

Meeting with Dodds

subtract mean of 50
redo with larger font

jwzimmer-zz commented 3 years ago

Subtracting the mean of 50, rather than the overall mean.

jwzimmer-zz commented 3 years ago

Only the trait from the indicated side (positive left, negative right) with the largest magnitudes for the first 3 dimensions (theoretical mean removed)

dim3

jwzimmer-zz commented 3 years ago

To do:

go through adjectives in table 4 versus top few traits for establishing relationship to FFM: https://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.470.4858&rep=rep1&type=pdf
and HEXACO
and dark triad
and stupidity

jwzimmer-zz commented 3 years ago

Relative size of dimensions (sigma values, theoretical mean removed, first 20 dimensions)

jwzimmer-zz commented 3 years ago

Notes from talking with Dodds Oct 12

Rerunning SVD after you use some minimum threshhold for number of ratings
What happens if you're missing data?
What happens by work/ storyverse?
What happens if you "translate" first -- take the traits most important to a storyverse (or time period or whatever) first, then do SVD
Inner products to see how close the dimensions are
Number of traits removed vs. vector 1 from whole svd vs. vector 1 from the modified one
Note which works are still present
Most-answered traits
Most-answered characters
Most-answered storyverses
Other idioms? Heroes and zeroes, saints and sinners

jwzimmer-zz commented 2 years ago

What can columns of V^T and rows of U tell us (as opposed to rows of V^T and columns of U)?

The columns of V^T tell us how a single trait contributes to each dimension of traits. We can look through the values of the columns to see which dimension that trait is most important to. The rows of U tell us how well a single character is described by each dimension. We can look through the values in the row to see which dimension best describes that character.

Plotting the first 3 traits -- diligent<->lazy, competent<->incompetent, disorganized<->self-disciplined -- with the highest magnitudes from dimension 1 (row 1 of V^T) in the first 15 dimensions:

plt.plot(V2[:,134][:15]); plt.plot(V2[:,31][:15]); plt.plot(V2[:,74][:15])

"Reversing" the last trait that is backwards from the other two:

plt.plot(V2[:,134][:15]); plt.plot(V2[:,31][:15]); plt.plot(-1*V2[:,74][:15])

We would not expect similar traits to track each other perfectly unless they had identical meanings. But seeing where they converge and diverge can perhaps help us pinpoint how some dimensions differ from each other.

Finding the order of the traits: bap_map[bap_map["low/left anchor"]=="hard"]

"hard<->soft" and "hard<->soft 2" track each other for the first 15 dimensions (good -- sanity check) plt.plot(V2[:,97][:15]); plt.plot(V2[:,182][:15])

It looks like as the dimensions progress they track each other more poorly, maybe indicating that the dimensions get less meaningful as they progress -- at some point they represent quirks of our specific dataset rather than underlying structure; replicating our exact initial data isn't meaningful or important.

Where do they diverge? Dimensions 15 - 30: Dimensions 0 - 30: Dimensions 5 - 10: Dimensions 0 - 10:

So they look like they diverge after the 7 or 8th dimension -- maybe evidence to focus on the first few?

jwzimmer-zz / tv-tropening

Label the darn axes, NO BAD IDEAS #12