tapilab / ctrosset

0 stars 0 forks source link

Compute cross-validation mean-squared error for gender prediction #7

Open aronwc opened 10 years ago

aronwc commented 10 years ago

in progress. see loadMatrices.py

cyril94440 commented 10 years ago

I tried to compute cross-validation with an X_train.shape of (79, 75050) And Y_train.shape of (79, 2). But it seems that the shape of Y should not be (79,2). "ValueError: bad input shape (79, 2)" I don't understand why, as I have to predict 2 criteria (Male & Female in this case).

aronwc commented 10 years ago

If you put all the pkl files on the server, I may be able to debug this more easily (also the latest version of the sql DB).

cyril94440 commented 10 years ago

The pkl files are in my home directory on tapi. It seems that it misses a library on the server.

aronwc commented 10 years ago

Change SVC (which is a classifier, not a regression algorithm) to from sklearn.linear_model import Ridge

aronwc commented 10 years ago

Trying to compute average mean-squared error of 10-fold cross validation

from sklearn.cross_validation import cross_val_score as cv
from sklearn.cross_validation import KFold
yy = Y[:,0].todense()
np.mean(cv(m, X, yy, cv=KFold(n=X.shape[0], n_folds=10), scoring='mean_squared_error'))

Why is this negative?

cyril94440 commented 10 years ago

I got this : https://github.com/scikit-learn/scikit-learn/issues/2439

which is an open issue from scikit repository.

It seems that it is just because the regression is performing poorly.

aronwc commented 10 years ago

Write the cross validation for loop using k-fold:

X = np.array([[1,2,3], [4,5,6], [7,8,9], [10,11,12]]) y = np.array([.1, .2, .3, .4]) cv = cross_validation.KFold(len(y), n_folds=2, random_state=1234) for train, test in cv: ... print 'train indices:', train ... print 'test indices:', test ... # m.fit(X[train], y[train]) ... # preds = m.predict(X[test]) ... train indices: [2 3] test indices: [0 1] train indices: [0 1] test indices: [2 3]

cyril94440 commented 10 years ago

I've done this but I am not sure what I should output then.

the scatter plot of f(%Male predicted) = %Male true with m.predict(X) (on the full datas) ?

Moreover n_folds = 2 seems a bit low ? Should I try with 10, 100 ?

Thank you

aronwc commented 10 years ago

Each point on the scatter plot should have x=%male predicted, y=true %male for one company.

try n_folds=10

cyril94440 commented 10 years ago

This is what I get

capture decran 2014-04-17 a 16 46 18

I never use the preds variable in the for loop, maybe I am missing something

aronwc commented 10 years ago

Code isn't quite right. Predictions should be stored after each .fit call.

Try something like this:

cv = cross_validation.KFold(len(Yd), n_folds=10, random_state=1234)
predicted_values = []
true_values = []
for train, test in cv:
    print 'train indices:', train
    print 'test indices:', test
    clf.fit(Xd[train], Yd[train])
    preds = clf.predict(Xd[test])
    predicted_values.extend(preds)
    true_values.extend(Yd[test]) 

plot(predicted_values,true_values,'.')
cyril94440 commented 10 years ago

Here are the wonderful results ...

capture decran 2014-04-17 a 17 38 01

aronwc commented 10 years ago

Much more like it!

This makes it pretty clear we should remove four outliers: the two with the highest % true male and the two with the lowest % true male.

Please also compute the correlation of predicted/truth, which you can do like so:

import scipy.stats as scistat
corr = scistat.pearsonr(predicted_values, true_values)

corr is a tuple; element 1 is the the correlation coefficient (close to 1 is good), element 2 is the the p-value (lower is better).

cyril94440 commented 10 years ago

(0.31346599890683069, 0.00025233074927076831) without removing the 4 companies.

cyril94440 commented 10 years ago

After removing the 4 companies :

capture decran 2014-04-23 a 15 43 23

corr = (0.47207203378278723, 1.8513629914893548e-08)

cyril94440 commented 10 years ago

Age prediction : capture decran 2014-04-23 a 19 16 11

cyril94440 commented 10 years ago

With the legend :

capture decran 2014-04-23 a 19 30 25

aronwc commented 10 years ago

repeat for age/income and include in report.

cyril94440 commented 10 years ago

Age prediction :

capture decran 2014-05-02 a 14 11 04

Corr score for each range : Column 0 Corr score : (0.84041720755541804, 1.4647775535410573e-07) Column 1 Corr score : (0.78171257894513435, 3.9472794342958712e-06) Column 2 Corr score : (0.82364140352887472, 4.237192662956088e-07) Column 3 Corr score : (0.77557280152381003, 5.2555027984410807e-06) Column 4 Corr score : (0.74903241695616929, 1.6477979211241716e-05) Column 5 Corr score : (0.84991297279603095, 7.5982162212348953e-08) Column 6 Corr score : (0.79632055128906887, 1.9232464816732786e-06) Column 7 Corr score : (0.91987964792819499, 7.9471738426185753e-11) Column 8 Corr score : (0.78411409856952641, 3.520453176836077e-06)

Here are the TOP 10 Weights :

Column 0 TOP 10 twitter IDs : [u'AmericanDadFOX', u'justinbieber', u'LilTunechi', u'TheSimpsons', u'SethMacFarlane', u'ArianaGrande', u'VictoriasSecret', u'CHANEL', u'katyperry', u'onedirection'] Column 1 TOP 10 twitter IDs : [u'LilTunechi', u'KimKardashian', u'ConanOBrien', u'justinbieber', u'Eminem', u'Drake', u'blakeshelton', u'ActuallyNPH', u'ArianaGrande', u'MonsterEnergy'] Column 2 TOP 10 twitter IDs : [u'danawhite', u'ufc', u'prattprattpratt', u'Nick_Offerman', u'azizansari', u'evilhag', u'espn', u'SportsCenter', u'TheRock', u'mradamscott'] Column 3 TOP 10 twitter IDs : [u'Photoshop', u'WIRED', u'TED_TALKS', u'Illustrator', u'NatGeo', u'danawhite', u'DalaiLama', u'nytimes', u'BBCBreaking', u'NASA'] Column 4 TOP 10 twitter IDs : [u'WIRED', u'BillGates', u'TechCrunch', u'YourAnonNews', u'mashable', u'wikileaks', u'google', u'ggreenwald', u'lifehacker', u'ericschmidt'] Column 5 TOP 10 twitter IDs : [u'BillGates', u'WIRED', u'TechCrunch', u'TED_TALKS', u'mashable', u'Forbes', u'google', u'nytimes', u'lifehacker', u'TheEconomist'] Column 6 TOP 10 twitter IDs : [u'nranews', u'NatGeo', u'SmithWessonCorp', u'NASA', u'GLOCKInc', u'FoxNews', u'NRA_Rifleman', u'TED_TALKS', u'BillGates', u'Forbes'] Column 7 TOP 10 twitter IDs : [u'nranews', u'SmithWessonCorp', u'NRA_Rifleman', u'GLOCKInc', u'RemingtonArms', u'ColtFirearms', u'NRAblog', u'Beretta_USA', u'GunOwners', u'FoxNews'] Column 8 TOP 10 twitter IDs : [u'nranews', u'SmithWessonCorp', u'GLOCKInc', u'FoxNews', u'NRA_Rifleman', u'RemingtonArms', u'ColtFirearms', u'Beretta_USA', u'NRAblog', u'Mountsplus']

cyril94440 commented 10 years ago

Income prediction :

capture decran 2014-05-02 a 14 35 14

Corr score for each range : Column 0 Corr score : (0.84316668000893014, 1.2166322949965415e-07) Column 1 Corr score : (0.77445396564303948, 5.5316536748780668e-06) Column 2 Corr score : (0.56913064642971378, 0.0029865286058942259) Column 3 Corr score : (0.77736220447765658, 4.8392983863695777e-06) Column 4 Corr score : (0.83469924417414731, 2.1315818969727176e-07) Column 5 Corr score : (0.8197818572683927, 5.3264225012538284e-07) Column 6 Corr score : (0.66026862761626615, 0.0003284530062277595) Column 7 Corr score : (0.69176385952564012, 0.00012797928403042552)

TOP 10 Weights :

Column 0 TOP 10 twitter IDs : [u'justinbieber', u'AmericanDadFOX', u'katyperry', u'ArianaGrande', u'YouTube', u'CHANEL', u'VictoriasSecret', u'LilTunechi', u'ComedyCentral', u'selenagomez'] Column 1 TOP 10 twitter IDs : [u'KimKardashian', u'LilTunechi', u'ConanOBrien', u'justinbieber', u'Drake', u'Eminem', u'ArianaGrande', u'blakeshelton', u'KDTrey5', u'ActuallyNPH'] Column 2 TOP 10 twitter IDs : [u'prattprattpratt', u'danawhite', u'Nick_Offerman', u'azizansari', u'49ers', u'evilhag', u'ufc', u'mradamscott', u'iamrashidajones', u'SportsCenter'] Column 3 TOP 10 twitter IDs : [u'danawhite', u'ufc', u'espn', u'SportsCenter', u'TheRock', u'AdamSchefter', u'cnnbrk', u'prattprattpratt', u'joerogan', u'JonnyBones'] Column 4 TOP 10 twitter IDs : [u'BillGates', u'WIRED', u'TechCrunch', u'mashable', u'nytimes', u'Forbes', u'richardbranson', u'WSJ', u'BBCBreaking', u'cnnbrk'] Column 5 TOP 10 twitter IDs : [u'nranews', u'BillGates', u'SmithWessonCorp', u'WIRED', u'GLOCKInc', u'TechCrunch', u'NRA_Rifleman', u'FoxNews', u'RemingtonArms', u'ColtFirearms'] Column 6 TOP 10 twitter IDs : [u'BillGates', u'TechCrunch', u'ericschmidt', u'WIRED', u'mashable', u'elonmusk', u'google', u'jeffweiner', u'jack', u'dickc'] Column 7 TOP 10 twitter IDs : [u'BillGates', u'TechCrunch', u'ericschmidt', u'WIRED', u'mashable', u'elonmusk', u'google', u'jeffweiner', u'dickc', u'jack']

cyril94440 commented 10 years ago

I get a MSE of 157.565244494 for gender with SET 2 (demographics pro). Is this good ? I have no idea.