Open jwzimmer-zz opened 3 years ago
Want to make: lists/ word clouds based on the traits which have the most positive, most neutral, and most negative weights in each of the first 3 dimensions -- this should lead to a D&D style alignment chart (3x3) which will hopefully show a clear pattern? Maybe also do with characters?
Pseudocode:
Using this tutorial: https://towardsdatascience.com/how-to-make-word-clouds-in-python-that-dont-suck-86518cdcb61f
Saving visualizations here as I make them so I can hopefully tell/ remember what they are in the future: https://docs.google.com/presentation/d/1_kc36iI6B2OmsZlbMxLB0xiT0ePaQQ2qh7NykKqsefI/edit?usp=sharing
I'm using this function in the file nextstep.py to make very basic word clouds, giving the wordcloud python package the scores for each trait in each row of V as if they were "frequencies" even though they are not and letting the built in function generate_from_frequencies interpret that however it may.
def simple_wordcloud(matrix_array,num_row,item_names):
#assuming np array, and interested in rows (not columns)
matrix_row = matrix_array[num_row,:]
matrix_dict = dict(zip(item_names,matrix_row))
# change the value to black
def black_color_func(word, font_size, position,orientation,random_state=None, **kwargs):
return("hsl(0,100%, 1%)")
# set the wordcloud background color to white
# set max_words to 1000
# set width and height to higher quality, 3000 x 2000
wordcloud = WordCloud(background_color="white", width=3000, height=2000, max_words=500).generate_from_frequencies(matrix_dict)
# set the word color to black
wordcloud.recolor(color_func = black_color_func)
plt.imshow(wordcloud)
return wordcloud
This seems to cause spyder to crash pretty often, so I made it lower quality and take fewer words:
def simple_wordcloud(matrix_array,num_row,item_names):
#assuming np array, and interested in rows (not columns)
matrix_row = matrix_array[num_row,:]
matrix_dict = dict(zip(item_names,matrix_row))
# change the value to black
def black_color_func(word, font_size, position,orientation,random_state=None, **kwargs):
return("hsl(0,100%, 1%)")
# set the wordcloud background color to white
wordcloud = WordCloud(background_color="white", max_words=300).generate_from_frequencies(matrix_dict)
# set the word color to black
wordcloud.recolor(color_func = black_color_func)
plt.imshow(wordcloud)
return wordcloud
For the first 6 rows of V, that results in:
For reference, the weights in the relevant sigma matrix are as follows, they get pretty small by at least e.g. 15 in: [4571.60069027, 3977.77079978, 3148.95421275, 2330.72490479, 1863.71976093, 1422.81288847, 1389.55887554, 1311.97892059, 1024.52207029, 924.99578844, 890.04169256, 774.10750006, 728.56171378, 665.35637138, 607.12742436, 592.00578495, 567.349024 , 517.6459256 , 504.80145181, 496.07731617, 482.19264758, 476.40215009, 450.10001708, 430.57116746, 419.4340081 , 409.40418421, 406.91557611, 394.68135566, 385.86736628, 377.25202325, 372.82985457, 356.41834577, 351.72495156, 347.74900228, 339.38741564, 333.79399487, 326.55896904, 324.55117824, 318.1295393 , 315.10346038, 308.25490266, 299.26762091, 295.0152497 , 289.96992691, 288.85032287, 281.9690584 , 276.58233643, 272.43157464, 271.90675683, 266.18190642, 263.66959715, 259.37314041, 256.60410545, 254.70809149, 252.68163905, 247.4687916 , 245.9929314 , 244.85792913, 241.82939261, 239.93695879, 235.41983321, 231.65486434, 230.66525203, 226.50847219, 226.02510515, 224.1026829 , 221.54346153, 218.66355964, 216.47694732, 215.91077504, 215.1921562 , 213.26036446, 211.09757665, 208.67226747, 206.33412318, 203.46306598, 202.00129305, 198.56407551, 197.94191631, 196.92701143, 195.12457197, 192.86726501, 190.96810407, 189.87703712, 189.53880978, 189.06950455, 188.50868644, 185.43954795, 182.61832651, 181.50209107, 179.99456989, 178.44065033, 177.4464131 , 176.71970982, 175.55113009, 174.82536567, 172.5053995 , 171.26372319, 170.70552398, 168.26458816, 167.98707707, 165.43564178, 165.08935084, 164.83722953, 162.13498 , 161.42803178, 160.30528848, 159.49239512, 159.21142423, 158.28515706, 157.13243679, 155.45907298, 153.86700243, 153.59045706, 152.35155954, 150.58777948, 149.58254526, 149.01649307, 148.12937946, 146.70195903, 145.70874362, 144.44385711, 143.42005057, 142.91038791, 141.60808627, 141.4631097 , 140.21726391, 139.21397298, 137.97307267, 137.48926772, 136.25779283, 135.36367309, 134.63421905, 133.13706912, 132.23788945, 130.99681122, 130.49813038, 129.36471842, 129.21269304, 128.19432229, 126.64128126, 126.28955773, 125.6550039 , 124.83269046, 124.14202611, 122.74294555, 120.90210089, 120.42513441, 119.67430339, 119.42790495, 119.24266103, 117.64425693, 117.36405301, 116.18928162, 115.39920124, 114.76582936, 113.78433957, 113.4101737 , 112.08557423, 111.10900704, 110.64820963, 110.17308651, 109.86539458, 107.9222643 , 107.68943644, 106.66859772, 105.97305812, 105.54842185, 104.91505923, 103.6099165 , 102.85102213, 102.2707889 , 101.34768192, 101.09396798, 100.60069694, 99.77079538, 98.96423342, 98.31576173, 97.81404952, 96.80348288, 96.30542631, 95.57328745, 95.13722686, 93.98035775, 93.2769668 , 92.75871829, 92.42652245, 91.92542246, 91.05170611, 90.01036083, 89.7513737 , 89.22258541, 88.78020526, 88.65292871, 87.40167041, 86.49717578, 85.02984127, 84.81686455, 84.40993647, 82.99396525, 82.35567233, 81.60991198, 81.36376152, 79.95434487, 79.39810207, 79.08318183, 77.83822367, 77.22776508, 76.30862441, 75.47880711, 74.9228648 , 74.77301107, 73.84800751, 73.60236366, 72.87570326, 72.38778495, 71.67473456, 70.54334797, 69.59775162, 69.28375765, 68.05775428, 67.04052598, 65.98883931, 64.82344704, 64.49868912, 64.02335036, 63.06458598, 62.70576773, 62.01116387, 60.33286102, 59.29124147, 58.27457299, 56.17688565, 55.46391112, 53.41931643, 48.49385579]
Moving on to a more D-D like chart...
def simple_wordcloud(matrix_dict):
# change the value to black
def black_color_func(word, font_size, position,orientation,random_state=None, **kwargs):
return("hsl(0,100%, 1%)")
# set the wordcloud background color to white
wordcloud = WordCloud(background_color="white", max_words=300).generate_from_frequencies(matrix_dict)
# set the word color to black
wordcloud.recolor(color_func = black_color_func)
plt.imshow(wordcloud)
return wordcloud
def make_dd_wordcloud_dicts(matrix_array,num_row,item_names):
#assuming np array, and interested in rows (not columns)
matrix_row = matrix_array[num_row,:]
matrix_dict = dict(zip(item_names,matrix_row))
#sorted from most negative to least negative, ascending
sorted_md = {k: v for k, v in sorted(matrix_dict.items(), key=lambda item: item[1])}
traits_list = list(sorted_md.keys())
scores_list = list(sorted_md.values())
dict1 = dict(zip(traits_list[:89],scores_list[:89]))
dict2 = dict(zip(traits_list[89:178],scores_list[89:178]))
dict3 = dict(zip(traits_list[178:],scores_list[178:]))
return dict1,dict2,dict3
Call like this for e.g. the 3rd row of V with means removed: d1_2,d2_2,d3_2 = make_dd_wordcloud_dicts(V2,2,col2), then do simple_wordcloud(d3_2) (etc.).
Produces this for the first 3 rows of V with means removed, dividing the traits into ordered 3rds... the first 89 with the most negative/ least positive, then the next 89 middle words, then the final 88 most positive words, those get put into 3 dicts by the above function; the way they are ordered in the chart is the opposite, the MOST POSITIVE at the top, etc. Maybe that isn't a good way to do that, because the spread isn't the same for every row of V, so I should come up with a new strategy for that?
Other visualization ideas
Meeting with Dodds notes
Looking at how the values in the rows of V2 (V^T) are distributed (per above comment/ conversation), using the version of V2 from running SVD with the overall mean (49.65 ish) removed via e.g. plt.scatter(range(1,237),V2[21,:])
.
First row of V
Second row of V
Third row of V
Actually I think this might be easier to see as a scatter plot?
First row of V
Second row of V
Third row of V
Fourth row of V
Fifth row of V
Sixth row of V
Seventh row of V
Eighth row of V
Ninth row of V
Tenth row of V
11th row of V
17th row of V
22nd row of V
27th row of V
52nd row of V
77th row of V
102nd row of V
202nd row of V
236th row of V (last row)
My Interpretation
Looking at trait magnitude in the rows of V (V2), with the overall mean removed --> https://github.com/jwzimmer/tv-tropening/commit/1c5699de65d0b12960c514ece5e827d5706a5e4c
Using this code to render the bar charts:
def vector_barchart(vector_names,vector,n,style="by_mag",ascending=False):
""" vector_names should be the labels for the values in the vector
vector should be the vector (ndarray)
n should be the number of values you want displayed in the chart
style should be the format of the chart
ascending=False will be most relevant traits by magnitude,
ascending=True will be least relevant traits by magnitude"""
n=min(n,len(vector_names))
vectordf = pd.DataFrame()
vectordf["Trait"] = col2
vectordf["Values"] = vector
if style=="by_mag":
vectordf["Magnitude"] = vectordf.apply(lambda row: abs(row["Values"]), axis = 1)
sorteddf = vectordf.sort_values(by="Magnitude",ascending=ascending)
#plotguy = sorteddf.iloc[-2*n:].iloc[::-1]
plotguy = sorteddf.iloc[0:2*n]
sns.barplot(plotguy["Values"],plotguy["Trait"])
return vectordf, plotguy
Interpretation: there do seem to be some patterns in the lower dimensions with "outlying" traits, e.g. a sort of leadership style component (captain<->first-mate seems to come up a lot) and a physical component (thick<->thin and tall<->short). And, in the lower dimensions, gender, sexuality, and procreation seem to come up a lot. To make a chart for a specific row of V, use vector_barchart(col2,V2[26,:],10,style="by_mag",ascending=False)
(that's the 27th row with 10 traits shown).
Meeting with Dodds
Subtracting the mean of 50, rather than the overall mean.
Only the trait from the indicated side (positive left, negative right) with the largest magnitudes for the first 3 dimensions (theoretical mean removed)
To do:
Relative size of dimensions (sigma values, theoretical mean removed, first 20 dimensions)
Notes from talking with Dodds Oct 12
What can columns of V^T and rows of U tell us (as opposed to rows of V^T and columns of U)?
The columns of V^T tell us how a single trait contributes to each dimension of traits. We can look through the values of the columns to see which dimension that trait is most important to. The rows of U tell us how well a single character is described by each dimension. We can look through the values in the row to see which dimension best describes that character.
Plotting the first 3 traits -- diligent<->lazy, competent<->incompetent, disorganized<->self-disciplined -- with the highest magnitudes from dimension 1 (row 1 of V^T) in the first 15 dimensions:
plt.plot(V2[:,134][:15]); plt.plot(V2[:,31][:15]); plt.plot(V2[:,74][:15])
"Reversing" the last trait that is backwards from the other two:
plt.plot(V2[:,134][:15]); plt.plot(V2[:,31][:15]); plt.plot(-1*V2[:,74][:15])
We would not expect similar traits to track each other perfectly unless they had identical meanings. But seeing where they converge and diverge can perhaps help us pinpoint how some dimensions differ from each other.
Finding the order of the traits:
bap_map[bap_map["low/left anchor"]=="hard"]
"hard<->soft" and "hard<->soft 2" track each other for the first 15 dimensions (good -- sanity check)
plt.plot(V2[:,97][:15]); plt.plot(V2[:,182][:15])
It looks like as the dimensions progress they track each other more poorly, maybe indicating that the dimensions get less meaningful as they progress -- at some point they represent quirks of our specific dataset rather than underlying structure; replicating our exact initial data isn't meaningful or important.
Where do they diverge? Dimensions 15 - 30: Dimensions 0 - 30: Dimensions 5 - 10: Dimensions 0 - 10:
So they look like they diverge after the 7 or 8th dimension -- maybe evidence to focus on the first few?
From trying to come up with what visuals I want in the paper, it has become clear I absolutely can't avoid labeling the axes anymore. I keep not doing it because I'm worried I'll do it wrong. So this is the No Bad Ideas version. If it's stupid I'm sure Dodds will let me know.
Basic idea: Which traits are most important to each "dimensions"? That will be those traits which have the most extreme weights in each ROW of V. Which characters best exemplify each "dimension"? That will be the characters which have the most extreme weights in each COLUMN of U. How much more important is the first "dimension" compared to the second? That is given by the relevant WEIGHT in Sigma.