SVD/ PCA on "character space"

jwzimmer-zz commented 3 years ago

https://openpsychometrics.org/_rawdata/

from tropes meeting with peter & phil

jwzimmer-zz commented 3 years ago

Making a plan:

[x] Get overleaf doc for correctly formatted draft
[x] Redo SVD/ PCA analysis because I would definitely believe we made mistakes -- if we get consistent results again with what we found in the first place, great -- if not, obviously a red flag to make sure we understand what we are doing.
- [x] With and without removing the mean of each "eigentrait"
- [x] Maybe re-orient original matrix to see what happens with "eigencharacters"?
[ ] Better visualizations?
[ ] Transfer any worthwhile content from older draft to the new template
[ ] Decide to what extent PDS should be mentioned, if at all?
[ ] Clean up draft and get feedback from D&D
[ ] Make new plan based on feedback

jwzimmer-zz commented 3 years ago

Codebook/ documenting what things are

Original data artifacts

An uncleaned (exactly the info from the original data/ website) python dict mapping BAP traits to the left and right anchor words that describe them: https://github.com/jwzimmer/tv-tropening/blob/main/June2021_column_dict_original.json
A pandas dataframe with all of the raw scores for characters x traits: https://github.com/jwzimmer/tv-tropening/blob/main/June2021_df_original.json
A dataframe with all the standard deviations (provided in original data xls): https://github.com/jwzimmer/tv-tropening/blob/main/June2021_df_std_original.json
A dataframe with the number of ratings (provided in original data xls): https://github.com/jwzimmer/tv-tropening/blob/main/June2021_df_n_original.json

jwzimmer-zz commented 3 years ago

Cleaned data artifacts

Python dict mapping BAP numbers to the anchor words describing that trait, with typos/ misspellings/ weird symbols corrected from the original data and dropping all emoji traits, and listing BAP183 as "hard<->soft 2" since that was asked twice: https://github.com/jwzimmer/tv-tropening/blob/main/July2021_cleaned_column_dict.json
Dataframe with additional columns for universe and character name, but without overwriting the BAP column headers: https://github.com/jwzimmer/tv-tropening/blob/main/July2021_df_bap.json
Identical dataframe to the previous one, but with BAP headers replaced with the anchor words for that trait: https://github.com/jwzimmer/tv-tropening/blob/main/July2021_df_traits.json

jwzimmer-zz commented 3 years ago

Rerunning the SVD without removing any means

Artifacts: all the artifacts created as part of the SVD process are saved to files with names matching the variable they were assigned as output of runSVD in this commit: https://github.com/jwzimmer/tv-tropening/commit/28845f7f4966d40a586aaac36199242bc0c716ad

This is how I reran the SVD:


df_traits = pd.read_json("July2021_df_traits.json")

clean_column_dict = get_json("July2021_cleaned_column_dict.json")

def runSVD(df1,dropcols=['unnamed.1','name','work'],n=None):
    if len(dropcols) > 0:
        for x in dropcols:
            df1 = df1.drop(x,axis=1)
    if n==None:
        n=df1.shape[1]-1
    X = df1.to_numpy()
    #decompose
    U, D, V = np.linalg.svd(X)
    # get dim of X
    M,N = X.shape
    # Construct sigma matrix in SVD (it simply adds null row vectors to match the dim of X)
    Sig = sp.linalg.diagsvd(D,M,N)
    # Now you can get X back:
    remakeX = np.dot(U, np.dot(Sig, V))
    assert np.sum(remakeX - X) < 0.00001
    return df1, U, D, V, Sig, X, remakeX

df1, U, D, V, Sig, X, remakeX = runSVD(df_traits)

df1.to_json("July2021_SVD_df.json")
write_json(U.tolist(),"July2021_SVD_U.json")
write_json(D.tolist(),"July2021_SVD_D.json")
write_json(V.tolist(),"July2021_SVD_V.json")
write_json(Sig.tolist(),"July2021_SVD_Sig.json")
write_json(X.tolist(),"July2021_SVD_X.json")
write_json(remakeX.tolist(),"July2021_SVD_remakeX.json")```

jwzimmer-zz commented 3 years ago

Rerunning SVD with removing the mean from each trait

Artifacts: all the artifacts created as part of the SVD process are saved to files with names matching the variable they were assigned as output of runSVD in this commit: https://github.com/jwzimmer/tv-tropening/commit/0ed3f652a3105a89cdd77d28cf28b1ee5365a44b

The code for rerunning it:


df_bap = pd.read_json("July2021_df_bap.json")
df_traits = pd.read_json("July2021_df_traits.json")

clean_column_dict = get_json("July2021_cleaned_column_dict.json")

def runSVD(df1,dropcols=['unnamed.1','name','work'],n=None):
    if len(dropcols) > 0:
        for x in dropcols:
            df1 = df1.drop(x,axis=1)
    if n==None:
        n=df1.shape[1]-1
    X = df1.to_numpy()
    #decompose
    U, D, V = np.linalg.svd(X)
    # get dim of X
    M,N = X.shape
    # Construct sigma matrix in SVD (it simply adds null row vectors to match the dim of X)
    Sig = sp.linalg.diagsvd(D,M,N)
    # Now you can get X back:
    remakeX = np.dot(U, np.dot(Sig, V))
    assert np.sum(remakeX - X) < 0.00001
    return df1, U, D, V, Sig, X, remakeX

# Output from SVD without removing means
df1, U, D, V, Sig, X, remakeX = runSVD(df_traits)

# Remove the average of each trait
df1_means = df1.mean()
df1_normed = df1 - df1_means

# Output from SVD WITH removing means
df2, U2, D2, V2, Sig2, X2, remakeX2 = runSVD(df1_normed,dropcols=[])

df2.to_json("July2021_normed_trait_df.json")
write_json(U2.tolist(),"July2021_SVD_normed_U.json")
write_json(D2.tolist(),"July2021_SVD_normed_D.json")
write_json(V2.tolist(),"July2021_SVD_normed_V.json")
write_json(Sig2.tolist(),"July2021_SVD_normed_Sig.json")
write_json(X2.tolist(),"July2021_SVD_normed_X.json")
write_json(remakeX2.tolist(),"July2021_SVD_normed_remakeX.json")```

jwzimmer-zz commented 3 years ago

Sanity check

Since there are dataframes that should differ only by their column headers (one with the original BAP trait labels and one with the anchor words, July2021_df_bap.json and July2021_df_traits.json), I can sanity check that the output from SVD should be the same from each of these dataframes (since SVD doesn't know about the column headers).

When the code below is run, the assert statements passed successfully (which is good). To rerun, use the script saved here: https://github.com/jwzimmer/tv-tropening/commit/d8d7c4c2c169ac911113e031f303c14386ca2d8a#diff-bb5be8dc2521f069449811f33a63824ae9dd7b3b0391c62d8fbdd7ab495809f8

How I sanity-checked:


df_bap = pd.read_json("July2021_df_bap.json")
df_traits = pd.read_json("July2021_df_traits.json")

clean_column_dict = get_json("July2021_cleaned_column_dict.json")

def runSVD(df1,dropcols=['unnamed.1','name','work'],n=None):
    if len(dropcols) > 0:
        for x in dropcols:
            df1 = df1.drop(x,axis=1)
    if n==None:
        n=df1.shape[1]-1
    X = df1.to_numpy()
    #decompose
    U, D, V = np.linalg.svd(X)
    # get dim of X
    M,N = X.shape
    # Construct sigma matrix in SVD (it simply adds null row vectors to match the dim of X)
    Sig = sp.linalg.diagsvd(D,M,N)
    # Now you can get X back:
    remakeX = np.dot(U, np.dot(Sig, V))
    assert np.sum(remakeX - X) < 0.00001
    return df1, U, D, V, Sig, X, remakeX

# Output from SVD without removing means
df1, U, D, V, Sig, X, remakeX = runSVD(df_traits)

# Remove the average of each trait
df1_means = df1.mean()
df1_normed = df1 - df1_means

# Output from SVD WITH removing means
df2, U2, D2, V2, Sig2, X2, remakeX2 = runSVD(df1_normed,dropcols=[])

# Remove the average of each trait using the BAP df as a sanity check
df3, U3, D3, V3, Sig3, X3, remakeX3 = runSVD(df_bap)

# Remove the average of each trait
df3_means = df3.mean()
df3_normed = df3 - df3_means
df4, U4, D4, V4, Sig4, X4, remakeX4 = runSVD(df3_normed,dropcols=[])

assert np.sum(U4-U2) < 0.0001
assert np.sum(D4-D2) < 0.0001
assert np.sum(V4-V2) < 0.0001
assert np.sum(remakeX2-remakeX4) < 0.0001```

jwzimmer-zz commented 3 years ago

Transposing the matrix

We've been using the matrix in the form that characters are each row and the traits are the columns. While I'm redoing the SVD anyway, I made a version with the matrix transposed so that the characters are the columns and the traits are the rows, just in case we want to do something with that.

The artifacts (outputs from SVD and transposed dataframes) and the script are saved in this commit: https://github.com/jwzimmer/tv-tropening/commit/2560f8741a91e990f711dc9bf767b2a2b35fe364

jwzimmer-zz commented 3 years ago

(In nextstep.py)

To get just characters from a certain work, e.g. Pride and Prejudice (rows 259 - 268 in df_traits): df_traits.loc[df_traits["work"]=="Pride and Prejudice"]

The means for each trait have been removed from df2, as well as the "extra" columns (name, work, etc.), so we can select the matching rows by index: df2.iloc[259:269]

Then we can run SVD on a single work (not for any results per se but to get a smaller artifact that's easier to understand intermediately):

df2pp = df2.iloc[259:269]
df2pp, U2pp, D2pp, V2pp, Sig2pp, X2pp, remakeX2pp = runSVD(df2pp, dropcols=[])

To get a really small matrix to use as a toy model

Using just the rows corresponding to characters from Pride and Prejudice, we can see which traits contribute most by taking the absolute value of all scores and then summing per column: df2pp.abs().sum()

Then we can get the top n=15 traits with the largest sums using: df2pp.abs().sum().nlargest(15). This gives us: gossiping<->confidential 309.5000 judgemental<->accepting 289.6970 independent<->codependent 280.2340 scandalous<->proper 279.9235 selfish<->altruistic 279.9000 cunning<->honorable 277.7000 trash<->treasure 272.2000 young<->old 271.2345 arrogant<->humble 270.3045 wholesome<->salacious 268.9000 sheltered<->street-smart 267.9560 rich<->poor 267.3715 quarrelsome<->warm 263.9230 masculine<->feminine 263.3865 rude<->respectful 261.8985

We can put those traits into a list:

pplist = ['gossiping<->confidential',
 'judgemental<->accepting',
 'independent<->codependent',
 'scandalous<->proper',
 'selfish<->altruistic',
 'cunning<->honorable',
 'trash<->treasure',
 'young<->old',
 'arrogant<->humble',
 'wholesome<->salacious',
 'sheltered<->street-smart',
 'rich<->poor',
 'quarrelsome<->warm',
 'masculine<->feminine',
 'rude<->respectful']

The df2pp[pplist] will gives us a 10 character x 15 trait matrix with the means already removed (the means per trait over all 800 characters, not over these 10 characters specifically). That df is saved in this commit: https://github.com/jwzimmer/tv-tropening/commit/46383460b1728939c1163bc1afa664be7685c151

We can run SVD on this toy matrix: DF, u, d, v, sig, x, remake_x = runSVD(df2pp[pplist],dropcols=[])

This yields:

u=array([[-0.08026365,  0.28342991, -0.45526078, -0.0664045 ,  0.50874406,
        -0.33335   ,  0.18430594, -0.0057889 ,  0.44852539, -0.31053718],
       [-0.08396192,  0.15645385,  0.14053663, -0.70715547,  0.2271066 ,
        -0.2511955 , -0.02011789, -0.39023386, -0.36354705,  0.22163698],
       [ 0.25954173, -0.31034307, -0.35522931,  0.15628852, -0.04000741,
        -0.60984357, -0.04676285,  0.18937151, -0.12989473,  0.50722832],
       [ 0.40650185, -0.11318425,  0.36149316,  0.35004533,  0.29471143,
        -0.15955976, -0.32223942, -0.55967292,  0.18397752, -0.07417983],
       [-0.33411562, -0.286026  , -0.16265535,  0.38160353,  0.27866994,
         0.00105568,  0.28555704, -0.23232173, -0.59800975, -0.25236469],
       [-0.3715037 , -0.43635952, -0.10259882, -0.05714874,  0.22120278,
         0.34615965,  0.06656464, -0.232405  ,  0.42110127,  0.50324595],
       [ 0.34982716, -0.45739007,  0.21871478, -0.28809451,  0.48994575,
         0.15302199,  0.07844722,  0.48020423, -0.063472  , -0.18259064],
       [-0.41318364, -0.42683061, -0.00921618, -0.1974537 , -0.20780707,
        -0.29966454, -0.54353286,  0.03654642,  0.11218081, -0.4070519 ],
       [ 0.30147215, -0.34947551, -0.08825894, -0.23787739, -0.43740969,
        -0.079282  ,  0.55174215, -0.34925464,  0.17587574, -0.26012067],
       [ 0.34679823, -0.01795822, -0.65071337, -0.14527374,  0.00109131,
         0.43483593, -0.40942224, -0.1915533 , -0.18289531, -0.08956964]])

d=array([273.45005938, 157.11930121, 131.97187477,  81.04634391,
        70.76989585,  52.32877291,  32.73332522,  21.88512639,
        13.30087534,   7.50968549])

v=array([[-0.33985495, -0.29612671,  0.04303773, -0.19430756, -0.33997662,
        -0.32236686, -0.31278163,  0.1514476 , -0.32808836,  0.3097413 ,
         0.06290999,  0.12897509, -0.30557215, -0.04094033, -0.31257376],
       [ 0.30564872, -0.14496602, -0.52284552, -0.07679449,  0.09119113,
         0.02914019,  0.18982058,  0.19440876, -0.10905841, -0.00706552,
         0.48225663, -0.09168561, -0.16293022, -0.49084418, -0.05395979],
       [-0.02769439,  0.24078822,  0.03745725, -0.57672513,  0.02246244,
        -0.13091967,  0.07329089, -0.40214836,  0.10072198,  0.01570634,
         0.34182379,  0.52299393,  0.1187455 , -0.00601891,  0.08720821],
       [-0.08025258,  0.29517503,  0.37569212,  0.01795614, -0.05874584,
        -0.07298374, -0.30536557,  0.03718126,  0.07454569,  0.1948677 ,
         0.05683033, -0.24349984,  0.24578564, -0.67877057,  0.18537411],
       [ 0.1150476 , -0.10383944, -0.15196138, -0.0683828 , -0.08962658,
         0.12041794, -0.09952601, -0.78580216, -0.21228544, -0.0574279 ,
        -0.27112323, -0.29399761, -0.08990117, -0.19494395, -0.20951912],
       [-0.19131476,  0.38035801, -0.2932324 , -0.23653336, -0.07076555,
        -0.34197986,  0.10189438,  0.01179161,  0.09773657,  0.14230296,
         0.12327661, -0.62524381,  0.06220088,  0.32312917, -0.01013698],
       [-0.26004231, -0.22143183,  0.41447416, -0.47112382,  0.05220788,
         0.22821447,  0.45998814,  0.13790597, -0.03164187, -0.28475731,
        -0.04056089, -0.28285624, -0.13686034, -0.13176831, -0.06595748],
       [ 0.17358445,  0.27223799, -0.10482308, -0.2015069 ,  0.04509023,
         0.0990094 ,  0.29456974,  0.13610467, -0.03820593,  0.54587192,
        -0.54264616,  0.14601149, -0.28873467, -0.123614  ,  0.11472584],
       [ 0.62152663,  0.1666971 ,  0.36819711, -0.19534219, -0.01392521,
         0.19094972, -0.29921406,  0.0971171 ,  0.02161291,  0.06792114,
         0.21542289, -0.16411461, -0.24248052,  0.25467636, -0.27293712],
       [ 0.07840771, -0.42320563,  0.01940182, -0.16340421,  0.13880765,
        -0.06200609, -0.23698661, -0.1033672 ,  0.0482833 ,  0.13047996,
         0.05377752, -0.16884177, -0.23921517,  0.13632374,  0.75574347],
       [-0.0297133 ,  0.11938014,  0.33357639,  0.41703516,  0.06961158,
        -0.08548349,  0.41994792, -0.24354082, -0.44763044,  0.26838829,
         0.36398154, -0.03359123, -0.15364751,  0.09755973,  0.11274173],
       [ 0.21963318,  0.11255698,  0.07419373, -0.09983337,  0.42250793,
        -0.58157332, -0.06552255,  0.11094783, -0.41730944, -0.38047638,
        -0.25779172,  0.03189939,  0.00139478, -0.06049719,  0.0152392 ],
       [ 0.32877213, -0.37094867,  0.0552218 , -0.15893589, -0.18810026,
        -0.07471512,  0.18716202,  0.06698879, -0.21039446,  0.28156536,
        -0.06489546, -0.0591665 ,  0.71045387,  0.09594379, -0.01754533],
       [-0.15203352,  0.28918159, -0.18128653, -0.14865689, -0.12596379,
         0.49125909, -0.22555514,  0.15525241, -0.62053863, -0.1523707 ,
         0.03275004,  0.01742755,  0.13756812,  0.11159349,  0.25509985],
       [ 0.25201017,  0.11200719,  0.04172258,  0.06420225, -0.77477492,
        -0.21603036,  0.19136847,  0.02583857,  0.03211352, -0.34964351,
        -0.07113749,  0.0454031 , -0.16798456, -0.0573666 ,  0.26820543]])

sig=array([[273.45005938,   0.        ,   0.        ,   0.        ,
          0.        ,   0.        ,   0.        ,   0.        ,
          0.        ,   0.        ,   0.        ,   0.        ,
          0.        ,   0.        ,   0.        ],
       [  0.        , 157.11930121,   0.        ,   0.        ,
          0.        ,   0.        ,   0.        ,   0.        ,
          0.        ,   0.        ,   0.        ,   0.        ,
          0.        ,   0.        ,   0.        ],
       [  0.        ,   0.        , 131.97187477,   0.        ,
          0.        ,   0.        ,   0.        ,   0.        ,
          0.        ,   0.        ,   0.        ,   0.        ,
          0.        ,   0.        ,   0.        ],
       [  0.        ,   0.        ,   0.        ,  81.04634391,
          0.        ,   0.        ,   0.        ,   0.        ,
          0.        ,   0.        ,   0.        ,   0.        ,
          0.        ,   0.        ,   0.        ],
       [  0.        ,   0.        ,   0.        ,   0.        ,
         70.76989585,   0.        ,   0.        ,   0.        ,
          0.        ,   0.        ,   0.        ,   0.        ,
          0.        ,   0.        ,   0.        ],
       [  0.        ,   0.        ,   0.        ,   0.        ,
          0.        ,  52.32877291,   0.        ,   0.        ,
          0.        ,   0.        ,   0.        ,   0.        ,
          0.        ,   0.        ,   0.        ],
       [  0.        ,   0.        ,   0.        ,   0.        ,
          0.        ,   0.        ,  32.73332522,   0.        ,
          0.        ,   0.        ,   0.        ,   0.        ,
          0.        ,   0.        ,   0.        ],
       [  0.        ,   0.        ,   0.        ,   0.        ,
          0.        ,   0.        ,   0.        ,  21.88512639,
          0.        ,   0.        ,   0.        ,   0.        ,
          0.        ,   0.        ,   0.        ],
       [  0.        ,   0.        ,   0.        ,   0.        ,
          0.        ,   0.        ,   0.        ,   0.        ,
         13.30087534,   0.        ,   0.        ,   0.        ,
          0.        ,   0.        ,   0.        ],
       [  0.        ,   0.        ,   0.        ,   0.        ,
          0.        ,   0.        ,   0.        ,   0.        ,
          0.        ,   7.50968549,   0.        ,   0.        ,
          0.        ,   0.        ,   0.        ]])

Note dimensions: u is 10 x 10 (the number of characters/ rows), v is 15 x 15 (the number of traits/ columns), sig is 10 x 15 with 10 non-zero diagonal values.

jwzimmer-zz commented 3 years ago

An even smaller toy matrix

dfsmall = df2pp[pplist[:5]] will get the 10 characters from Pride and Prejudice x the top 5 traits from the list mentioned above:

	gossiping<->confidential	judgemental<->accepting	independent<->codependent	scandalous<->proper	selfish<->altruistic
259	32.5799	-25.7742	-24.1915	33.4618	8.39938
260	19.5799	-19.5742	-34.3915	-4.43825	13.1994
261	-32.6201	-33.7742	39.7085	24.5618	-27.2006
262	-40.7201	-16.1742	23.6085	-39.6382	-42.5006
263	9.27988	31.5258	28.8085	31.2618	22.9994
264	14.9799	38.1258	25.1085	25.6618	25.8994
265	-50.4201	-15.6742	26.0085	-37.3382	-40.3006
266	25.9799	39.4258	24.0085	40.6618	34.2994
267	-50.4201	-28.1742	39.6085	-9.43825	-28.8006
268	-32.9201	-41.4742	-14.7915	33.4618	-36.3006

We can then run SVD: dfs, u, d, v, sig, x, rex = runSVD(dfsmall, dropcols=[]).

To make the matrices easier to read, we can make dataframes from the arrays returned by runSVD, e.g. for the matrix U, dfu = pd.DataFrame.from_records(u). Then we can get a GitHub markdown table with print(dfu.to_markdown()):

	0	1	2	3	4	5	6	7	8	9
0	-0.174965	0.326296	0.429562	0.401283	0.442693	0.0303303	0.361543	-0.103124	0.355525	0.229617
1	-0.095925	0.408267	0.000175068	0.3096	-0.458768	0.343455	0.0980603	0.353716	-0.419526	0.2985
2	0.289264	-0.254787	0.51681	0.402896	0.130014	0.0265614	-0.150021	-0.0534204	-0.485991	-0.377936
3	0.433256	-0.107442	-0.227372	0.0490339	0.522595	0.195355	-0.316602	0.271603	-0.0631081	0.507333
4	-0.217048	-0.425549	0.14686	-0.0537437	-0.104094	-0.308657	0.15884	-0.330346	-0.33884	0.626152
5	-0.256956	-0.40636	0.0442409	-0.0868463	0.0124899	0.840803	0.0552487	-0.192599	0.107295	-0.0240949
6	0.458136	-0.157342	-0.200635	-0.0711835	0.0267243	0.0436022	0.830354	0.093688	-0.0994286	-0.0962616
7	-0.358611	-0.409782	0.174044	-0.0122618	0.0631441	-0.160897	0.0860923	0.786754	0.116035	-0.0574674
8	0.412107	-0.263198	0.149876	0.321032	-0.530391	-0.0257513	-0.107634	0.00337257	0.554674	0.184753
9	0.259505	0.205468	0.616565	-0.677998	-0.0808574	0.107604	-0.0126528	0.116771	0.00958342	0.136392

The matrix Sigma:

	0	1	2	3	4
0	167.959	0	0	0	0
1	0	103.103	0	0	0
2	0	0	85.5536	0	0
3	0	0	0	26.6988	0
4	0	0	0	0	14.3684
5	0	0	0	0	0
6	0	0	0	0	0
7	0	0	0	0	0
8	0	0	0	0	0
9	0	0	0	0	0

The matrix V:

	0	1	2	3	4
0	-0.608822	-0.421068	0.192499	-0.332048	-0.55202
1	0.243136	-0.482655	-0.81904	-0.17509	-0.0802911
2	-0.0560139	-0.447946	0.0686642	0.883551	-0.104065
3	0.434501	-0.603099	0.512022	-0.279439	0.327457
4	0.615055	0.159256	0.158861	0.0184046	-0.755493

jwzimmer-zz commented 3 years ago

Continuing with the above toy model, just trying to understand the SVD...

If you dot U with Sigma, you get a 10 x 5 matrix, which is the first 5 columns of U each multiplied by the corresponding weight from Sigma, so column 1 of U is multiplied by 167.959, weight 1 from Sigma; column 2 of U is multiplied by 103.103, weight 2 from Sigma, etc.

	0	1	2	3	4
0	-29.387	33.642	36.7505	10.7138	6.36082
1	-16.1115	42.0935	0.0149777	8.26593	-6.59179
2	48.5845	-26.2692	44.215	10.7568	1.86811
3	72.7694	-11.0776	-19.4525	1.30915	7.50888
4	-36.4552	-43.8753	12.5644	-1.43489	-1.49566
5	-43.1581	-41.8969	3.78496	-2.31869	0.179461
6	76.9483	-16.2224	-17.1651	-1.90051	0.383987
7	-60.232	-42.2497	14.8901	-0.327374	0.907282
8	69.2172	-27.1365	12.8224	8.57115	-7.62089
9	43.5864	21.1843	52.7493	-18.1017	-1.1618

If you dot Sigma with V, you get a 10 x 5 matrix in which the new first row is the first row of V multiplied by the first weight in Sigma; the second row is the second row of V multiplied by the second weight in Sigma, etc.

	0	1	2	3	4
0	-102.257	-70.7223	32.332	-55.7706	-92.7169
1	25.068	-49.7631	-84.4453	-18.0523	-8.27824
2	-4.79219	-38.3234	5.87447	75.591	-8.90313
3	11.6006	-16.102	13.6704	-7.46067	8.74271
4	8.83739	2.28826	2.28258	0.264445	-10.8553
5	0	0	0	0	0
6	0	0	0	0	0
7	0	0	0	0	0
8	0	0	0	0	0
9	0	0	0	0	0

Note: V is actually V^T, it has already been transposed when it is returned by linalg.svd. We know this is the case because you do not need to transpose it in order to get back your original data matrix (remakeX = np.dot(U, np.dot(Sig, V)) from the runSVD function).

The matrix product of Sigma dot V is what we will dot with U in order to get back our original data matrix. So the weighted rows of V and the columns of U are what describe our original matrix.

To approximate our original matrix:

We can tune how good of an approximation we want by how many non-zero weights are in Sigma. Since they are in descending order of importance, let's say we don't want to use all 5 rows of V in reconstructing our matrix; let's use the first three. We can use only the first 3 weights of sigma, and therefore the first 3 rows of V, like this:

We can then approximate our original data matrix with newx = np.dot(u, np.dot(newsig, v)):

	0	1	2	3	4
0	24.0125	-20.3258	-30.6877	36.3385	9.69663
1	20.0426	-13.5393	-37.5767	-2.00711	5.51259
2	-38.443	-27.5843	33.904	27.5332	-29.3117
3	-45.9073	-16.5805	21.7453	-39.4106	-37.2564
4	10.8233	30.8986	29.7808	30.8883	22.3393
5	15.877	36.6988	26.2672	25.0105	26.7942
6	-49.8305	-16.8816	26.9206	-37.8764	-39.3882
7	25.5641	39.0838	24.032	40.5536	35.092
8	-49.457	-21.7913	36.4305	-6.90288	-37.3648
9	-24.3404	-52.2064	-5.33847	28.4248	-31.2508

This isn't a great approximation of our original data, but it isn't totally insane looking... to sanity check, if we use 4 weights instead of 3, the approximation should improve, so let's check that that actually happens:

	0	1	2	3	4
0	28.6676	-26.7872	-25.202	33.3447	13.2049
1	23.6342	-18.5245	-33.3443	-4.31693	8.21933
2	-33.7691	-34.0718	39.4117	24.5274	-25.7893
3	-45.3385	-17.3701	22.4156	-39.7764	-36.8277
4	10.1998	31.7639	29.0461	31.2893	21.8694
5	14.8695	38.0972	25.08	25.6584	26.035
6	-50.6563	-15.7354	25.9475	-37.3453	-40.0105
7	25.4218	39.2813	23.8644	40.6451	34.9848
8	-45.7329	-26.9606	40.8192	-9.29799	-34.5582
9	-32.2056	-41.2892	-14.6069	33.4831	-37.1784

And this approximation is indeed closer to the original matrix (dfsmall) that we started out with earlier. Great!

So now we should get into the details of what is happening when we take the dot product of U and the matrix we get from taking the dot product of Sigma with V... so the number of non-zero weights we include in Sigma determines how many rows of V will be used to approximate our original matrix. When you take the dot product of U and this other matrix, SigmadotV, the rows of U will be combined with the columns of SigmadotV, which only have as many nonzero values as we've chosen to include in our approximation, the last few entries of EVERY row of U will be multiplied by 0 (and disregarded). Therefore, the last few COLUMNS of U will have no impact on the values in our final approximation. So when we are approximating our original matrix using U, Sigma, and V, we choose the first N weights of Sigma, the first N rows of V, and the first N columns of U to get combined into the final result.

The product of our new Sigma (only 3 weights) dotted with V gives us a 10 x 5 matrix containing the first 3 rows of V weighted by the corresponding weight in Sigma:

	0	1	2	3	4
0	-102.257	-70.7223	32.332	-55.7706	-92.7169
1	25.068	-49.7631	-84.4453	-18.0523	-8.27824
2	-4.79219	-38.3234	5.87447	75.591	-8.90313
3	0	0	0	0	0
4	0	0	0	0	0
5	0	0	0	0	0
6	0	0	0	0	0
7	0	0	0	0	0
8	0	0	0	0	0
9	0	0	0	0	0

So, when we approximate with N dimensions, we will use the first N columns of U, the first N weights of Sigma, and the first N rows of what I've been calling V (but which is really the first N rows of V^T/ the first N columns of V).

jwzimmer-zz commented 3 years ago

Interpretation

The columns of U must be "eigencharacters" in terms of the fictional characters; the rows of V must be "eigentraits" in terms of fictional traits. That is the only way I can understand the dimensions of the relevant objects.

Therefore, what I want to look at is the characters that comprise the first few columns of U as linear combinations, and the traits that comprise the first rows of V as linear combinations.

Which traits are most important to each "dimensions"? That will be those traits which have the most extreme weights in each ROW of V. Which characters best exemplify each "dimension"? That will be the characters which have the most extreme weights in each COLUMN of U. How much more important is the first "dimension" compared to the second? That is given by the relevant WEIGHT in Sigma.

jwzimmer-zz / tv-tropening

SVD/ PCA on "character space" #6