jwzimmer-zz / tv-tropening

1 stars 0 forks source link

Label the darn axes, NO BAD IDEAS #12

Open jwzimmer-zz opened 3 years ago

jwzimmer-zz commented 3 years ago

From trying to come up with what visuals I want in the paper, it has become clear I absolutely can't avoid labeling the axes anymore. I keep not doing it because I'm worried I'll do it wrong. So this is the No Bad Ideas version. If it's stupid I'm sure Dodds will let me know.

Basic idea: Which traits are most important to each "dimensions"? That will be those traits which have the most extreme weights in each ROW of V. Which characters best exemplify each "dimension"? That will be the characters which have the most extreme weights in each COLUMN of U. How much more important is the first "dimension" compared to the second? That is given by the relevant WEIGHT in Sigma.

jwzimmer-zz commented 3 years ago

Want to make: lists/ word clouds based on the traits which have the most positive, most neutral, and most negative weights in each of the first 3 dimensions -- this should lead to a D&D style alignment chart (3x3) which will hopefully show a clear pattern? Maybe also do with characters?

Pseudocode:

Using this tutorial: https://towardsdatascience.com/how-to-make-word-clouds-in-python-that-dont-suck-86518cdcb61f

Saving visualizations here as I make them so I can hopefully tell/ remember what they are in the future: https://docs.google.com/presentation/d/1_kc36iI6B2OmsZlbMxLB0xiT0ePaQQ2qh7NykKqsefI/edit?usp=sharing

I'm using this function in the file nextstep.py to make very basic word clouds, giving the wordcloud python package the scores for each trait in each row of V as if they were "frequencies" even though they are not and letting the built in function generate_from_frequencies interpret that however it may.

def simple_wordcloud(matrix_array,num_row,item_names):
    #assuming np array, and interested in rows (not columns)
    matrix_row = matrix_array[num_row,:]
    matrix_dict = dict(zip(item_names,matrix_row))

    # change the value to black
    def black_color_func(word, font_size, position,orientation,random_state=None, **kwargs):
        return("hsl(0,100%, 1%)")
    # set the wordcloud background color to white
    # set max_words to 1000
    # set width and height to higher quality, 3000 x 2000
    wordcloud = WordCloud(background_color="white", width=3000, height=2000, max_words=500).generate_from_frequencies(matrix_dict)
    # set the word color to black
    wordcloud.recolor(color_func = black_color_func)

    plt.imshow(wordcloud)
    return wordcloud

This seems to cause spyder to crash pretty often, so I made it lower quality and take fewer words:

def simple_wordcloud(matrix_array,num_row,item_names):
    #assuming np array, and interested in rows (not columns)
    matrix_row = matrix_array[num_row,:]
    matrix_dict = dict(zip(item_names,matrix_row))

    # change the value to black
    def black_color_func(word, font_size, position,orientation,random_state=None, **kwargs):
        return("hsl(0,100%, 1%)")
    # set the wordcloud background color to white
    wordcloud = WordCloud(background_color="white", max_words=300).generate_from_frequencies(matrix_dict)
    # set the word color to black
    wordcloud.recolor(color_func = black_color_func)

    plt.imshow(wordcloud)
    return wordcloud

For the first 6 rows of V, that results in:

Screen Shot 2021-09-20 at 3 04 55 PM

For reference, the weights in the relevant sigma matrix are as follows, they get pretty small by at least e.g. 15 in: [4571.60069027, 3977.77079978, 3148.95421275, 2330.72490479, 1863.71976093, 1422.81288847, 1389.55887554, 1311.97892059, 1024.52207029, 924.99578844, 890.04169256, 774.10750006, 728.56171378, 665.35637138, 607.12742436, 592.00578495, 567.349024 , 517.6459256 , 504.80145181, 496.07731617, 482.19264758, 476.40215009, 450.10001708, 430.57116746, 419.4340081 , 409.40418421, 406.91557611, 394.68135566, 385.86736628, 377.25202325, 372.82985457, 356.41834577, 351.72495156, 347.74900228, 339.38741564, 333.79399487, 326.55896904, 324.55117824, 318.1295393 , 315.10346038, 308.25490266, 299.26762091, 295.0152497 , 289.96992691, 288.85032287, 281.9690584 , 276.58233643, 272.43157464, 271.90675683, 266.18190642, 263.66959715, 259.37314041, 256.60410545, 254.70809149, 252.68163905, 247.4687916 , 245.9929314 , 244.85792913, 241.82939261, 239.93695879, 235.41983321, 231.65486434, 230.66525203, 226.50847219, 226.02510515, 224.1026829 , 221.54346153, 218.66355964, 216.47694732, 215.91077504, 215.1921562 , 213.26036446, 211.09757665, 208.67226747, 206.33412318, 203.46306598, 202.00129305, 198.56407551, 197.94191631, 196.92701143, 195.12457197, 192.86726501, 190.96810407, 189.87703712, 189.53880978, 189.06950455, 188.50868644, 185.43954795, 182.61832651, 181.50209107, 179.99456989, 178.44065033, 177.4464131 , 176.71970982, 175.55113009, 174.82536567, 172.5053995 , 171.26372319, 170.70552398, 168.26458816, 167.98707707, 165.43564178, 165.08935084, 164.83722953, 162.13498 , 161.42803178, 160.30528848, 159.49239512, 159.21142423, 158.28515706, 157.13243679, 155.45907298, 153.86700243, 153.59045706, 152.35155954, 150.58777948, 149.58254526, 149.01649307, 148.12937946, 146.70195903, 145.70874362, 144.44385711, 143.42005057, 142.91038791, 141.60808627, 141.4631097 , 140.21726391, 139.21397298, 137.97307267, 137.48926772, 136.25779283, 135.36367309, 134.63421905, 133.13706912, 132.23788945, 130.99681122, 130.49813038, 129.36471842, 129.21269304, 128.19432229, 126.64128126, 126.28955773, 125.6550039 , 124.83269046, 124.14202611, 122.74294555, 120.90210089, 120.42513441, 119.67430339, 119.42790495, 119.24266103, 117.64425693, 117.36405301, 116.18928162, 115.39920124, 114.76582936, 113.78433957, 113.4101737 , 112.08557423, 111.10900704, 110.64820963, 110.17308651, 109.86539458, 107.9222643 , 107.68943644, 106.66859772, 105.97305812, 105.54842185, 104.91505923, 103.6099165 , 102.85102213, 102.2707889 , 101.34768192, 101.09396798, 100.60069694, 99.77079538, 98.96423342, 98.31576173, 97.81404952, 96.80348288, 96.30542631, 95.57328745, 95.13722686, 93.98035775, 93.2769668 , 92.75871829, 92.42652245, 91.92542246, 91.05170611, 90.01036083, 89.7513737 , 89.22258541, 88.78020526, 88.65292871, 87.40167041, 86.49717578, 85.02984127, 84.81686455, 84.40993647, 82.99396525, 82.35567233, 81.60991198, 81.36376152, 79.95434487, 79.39810207, 79.08318183, 77.83822367, 77.22776508, 76.30862441, 75.47880711, 74.9228648 , 74.77301107, 73.84800751, 73.60236366, 72.87570326, 72.38778495, 71.67473456, 70.54334797, 69.59775162, 69.28375765, 68.05775428, 67.04052598, 65.98883931, 64.82344704, 64.49868912, 64.02335036, 63.06458598, 62.70576773, 62.01116387, 60.33286102, 59.29124147, 58.27457299, 56.17688565, 55.46391112, 53.41931643, 48.49385579]

Moving on to a more D-D like chart...

def simple_wordcloud(matrix_dict):
    # change the value to black
    def black_color_func(word, font_size, position,orientation,random_state=None, **kwargs):
        return("hsl(0,100%, 1%)")
    # set the wordcloud background color to white
    wordcloud = WordCloud(background_color="white", max_words=300).generate_from_frequencies(matrix_dict)
    # set the word color to black
    wordcloud.recolor(color_func = black_color_func)
    plt.imshow(wordcloud)
    return wordcloud

def make_dd_wordcloud_dicts(matrix_array,num_row,item_names):
    #assuming np array, and interested in rows (not columns)
    matrix_row = matrix_array[num_row,:]
    matrix_dict = dict(zip(item_names,matrix_row))
    #sorted from most negative to least negative, ascending
    sorted_md = {k: v for k, v in sorted(matrix_dict.items(), key=lambda item: item[1])}
    traits_list = list(sorted_md.keys())
    scores_list = list(sorted_md.values())

    dict1 = dict(zip(traits_list[:89],scores_list[:89]))
    dict2 = dict(zip(traits_list[89:178],scores_list[89:178]))
    dict3 = dict(zip(traits_list[178:],scores_list[178:]))
    return dict1,dict2,dict3

Call like this for e.g. the 3rd row of V with means removed: d1_2,d2_2,d3_2 = make_dd_wordcloud_dicts(V2,2,col2), then do simple_wordcloud(d3_2) (etc.).

Produces this for the first 3 rows of V with means removed, dividing the traits into ordered 3rds... the first 89 with the most negative/ least positive, then the next 89 middle words, then the final 88 most positive words, those get put into 3 dicts by the above function; the way they are ordered in the chart is the opposite, the MOST POSITIVE at the top, etc. Maybe that isn't a good way to do that, because the spread isn't the same for every row of V, so I should come up with a new strategy for that?

Screen Shot 2021-09-20 at 4 02 08 PM
jwzimmer-zz commented 3 years ago

Other visualization ideas

jwzimmer-zz commented 3 years ago

Meeting with Dodds notes

jwzimmer-zz commented 3 years ago

Looking at how the values in the rows of V2 (V^T) are distributed (per above comment/ conversation), using the version of V2 from running SVD with the overall mean (49.65 ish) removed via e.g. plt.scatter(range(1,237),V2[21,:]).

First row of V

Screen Shot 2021-09-25 at 4 35 54 PM

Second row of V

Screen Shot 2021-09-25 at 4 38 02 PM

Third row of V

Screen Shot 2021-09-25 at 4 38 35 PM

Actually I think this might be easier to see as a scatter plot?

First row of V

Screen Shot 2021-09-25 at 4 42 04 PM

Second row of V

Screen Shot 2021-09-25 at 4 42 38 PM

Third row of V

Screen Shot 2021-09-25 at 4 43 25 PM

Fourth row of V

Screen Shot 2021-09-25 at 4 43 57 PM

Fifth row of V

Screen Shot 2021-09-25 at 4 44 33 PM

Sixth row of V

Screen Shot 2021-09-25 at 4 45 17 PM

Seventh row of V

Screen Shot 2021-09-25 at 4 45 57 PM

Eighth row of V

Screen Shot 2021-09-25 at 4 46 52 PM

Ninth row of V

Screen Shot 2021-09-25 at 4 47 20 PM

Tenth row of V

Screen Shot 2021-09-25 at 4 47 59 PM

11th row of V

Screen Shot 2021-09-25 at 4 54 55 PM

17th row of V

Screen Shot 2021-09-25 at 4 55 35 PM

22nd row of V

Screen Shot 2021-09-25 at 4 56 05 PM

27th row of V

Screen Shot 2021-09-25 at 4 56 42 PM

52nd row of V

Screen Shot 2021-09-25 at 4 57 16 PM

77th row of V

Screen Shot 2021-09-25 at 4 57 43 PM

102nd row of V

Screen Shot 2021-09-25 at 4 58 14 PM

202nd row of V

Screen Shot 2021-09-25 at 4 58 44 PM

236th row of V (last row)

Screen Shot 2021-09-25 at 5 00 04 PM

My Interpretation

jwzimmer-zz commented 3 years ago

Looking at trait magnitude in the rows of V (V2), with the overall mean removed --> https://github.com/jwzimmer/tv-tropening/commit/1c5699de65d0b12960c514ece5e827d5706a5e4c

Using this code to render the bar charts:

def vector_barchart(vector_names,vector,n,style="by_mag",ascending=False):
    """ vector_names should be the labels for the values in the vector
        vector should be the vector (ndarray)
        n should be the number of values you want displayed in the chart
        style should be the format of the chart
        ascending=False will be most relevant traits by magnitude,
        ascending=True will be least relevant traits by magnitude"""
    n=min(n,len(vector_names))
    vectordf = pd.DataFrame()
    vectordf["Trait"] = col2
    vectordf["Values"] = vector

    if style=="by_mag":
        vectordf["Magnitude"] = vectordf.apply(lambda row: abs(row["Values"]), axis = 1)
        sorteddf = vectordf.sort_values(by="Magnitude",ascending=ascending)
        #plotguy = sorteddf.iloc[-2*n:].iloc[::-1]
        plotguy = sorteddf.iloc[0:2*n]
    sns.barplot(plotguy["Values"],plotguy["Trait"])
    return vectordf, plotguy
Screen Shot 2021-09-27 at 8 16 28 PM Screen Shot 2021-09-26 at 3 50 06 PM Screen Shot 2021-09-26 at 4 13 00 PM Screen Shot 2021-09-26 at 4 27 20 PM

Interpretation: there do seem to be some patterns in the lower dimensions with "outlying" traits, e.g. a sort of leadership style component (captain<->first-mate seems to come up a lot) and a physical component (thick<->thin and tall<->short). And, in the lower dimensions, gender, sexuality, and procreation seem to come up a lot. To make a chart for a specific row of V, use vector_barchart(col2,V2[26,:],10,style="by_mag",ascending=False) (that's the 27th row with 10 traits shown).

jwzimmer-zz commented 3 years ago

Meeting with Dodds

jwzimmer-zz commented 3 years ago

Subtracting the mean of 50, rather than the overall mean.

Screen Shot 2021-10-03 at 3 47 42 PM Screen Shot 2021-10-03 at 3 57 07 PM Screen Shot 2021-10-03 at 4 29 54 PM
jwzimmer-zz commented 3 years ago

Only the trait from the indicated side (positive left, negative right) with the largest magnitudes for the first 3 dimensions (theoretical mean removed)

dim3

jwzimmer-zz commented 3 years ago

To do:

jwzimmer-zz commented 3 years ago

Relative size of dimensions (sigma values, theoretical mean removed, first 20 dimensions)

Screen Shot 2021-10-05 at 7 42 20 PM
jwzimmer-zz commented 3 years ago

Notes from talking with Dodds Oct 12

jwzimmer-zz commented 2 years ago

What can columns of V^T and rows of U tell us (as opposed to rows of V^T and columns of U)?

The columns of V^T tell us how a single trait contributes to each dimension of traits. We can look through the values of the columns to see which dimension that trait is most important to. The rows of U tell us how well a single character is described by each dimension. We can look through the values in the row to see which dimension best describes that character.

Plotting the first 3 traits -- diligent<->lazy, competent<->incompetent, disorganized<->self-disciplined -- with the highest magnitudes from dimension 1 (row 1 of V^T) in the first 15 dimensions:

plt.plot(V2[:,134][:15]); plt.plot(V2[:,31][:15]); plt.plot(V2[:,74][:15])

image

"Reversing" the last trait that is backwards from the other two:

plt.plot(V2[:,134][:15]); plt.plot(V2[:,31][:15]); plt.plot(-1*V2[:,74][:15])

image

We would not expect similar traits to track each other perfectly unless they had identical meanings. But seeing where they converge and diverge can perhaps help us pinpoint how some dimensions differ from each other.

Finding the order of the traits: bap_map[bap_map["low/left anchor"]=="hard"]

"hard<->soft" and "hard<->soft 2" track each other for the first 15 dimensions (good -- sanity check) plt.plot(V2[:,97][:15]); plt.plot(V2[:,182][:15]) image

It looks like as the dimensions progress they track each other more poorly, maybe indicating that the dimensions get less meaningful as they progress -- at some point they represent quirks of our specific dataset rather than underlying structure; replicating our exact initial data isn't meaningful or important. image

Where do they diverge? Dimensions 15 - 30: image Dimensions 0 - 30: image Dimensions 5 - 10: image Dimensions 0 - 10: image

So they look like they diverge after the 7 or 8th dimension -- maybe evidence to focus on the first few?