Bad_alloc with even the smallest datasets

david-cortes commented 6 years ago

I've been trying to run this software on an artifically-generated dataset, and I am constantly running out of memory (bad_alloc) even in small datasets.

As an example, I generated the following random data in a Python script:

import numpy as np, pandas as pd
from scipy.sparse import coo_matrix
from sklearn.model_selection import train_test_split

nusers = 200
nitems = 300
ntopics = 30
nwords = 250

np.random.seed(1)
a=.3 + np.random.gamma(.1, .05)
b=.3 + np.random.gamma(.1, .05)
c=.3 + np.random.gamma(.1, .05)
d=.3 + np.random.gamma(.1, .05)
e=.3 + np.random.gamma(.1, .05)
f=.5 + np.random.gamma(.1, .05)
g=.3 + np.random.gamma(.1, .05)
h=.5 + np.random.gamma(.1, .05)

np.random.seed(1)
Beta = np.random.gamma(a, b, size=(nwords, ntopics))
Theta = np.random.gamma(c, d, size=(nitems, ntopics))
W = np.random.poisson(Theta.dot(Beta.T) + np.random.gamma(1, 1, size=(nitems, nwords)), size=(nitems, nwords))

Eta = np.random.gamma(e, f, size=(nusers, ntopics))
Epsilon = np.random.gamma(g, h, size=(nitems, ntopics))
R = np.random.poisson(Eta.dot(Theta.T+Epsilon.T) + np.random.gamma(1, 1, size=(nusers, nitems)), size=(nusers, nitems))

Rcoo=coo_matrix(R)
df = pd.DataFrame({
    'UserId':Rcoo.row,
    'ItemId':Rcoo.col,
    'Count':Rcoo.data
})

df_train, df_test = train_test_split(df, test_size=0.3, random_state=1)
df_test, df_val = train_test_split(df_test, test_size=0.33, random_state=2)

df_train.sort_values(['UserId', 'ItemId'], inplace=True)
df_test.sort_values(['UserId', 'ItemId'], inplace=True)
df_val.sort_values(['UserId', 'ItemId'], inplace=True)

df_train['Count'] = df_train.Count.values.astype('float32')
df_test['Count'] = df_test.Count.values.astype('float32')
df_val['Count'] = df_val.Count.values.astype('float32')

df_train.to_csv("<dir>/train.tsv", sep='\t', index=False, header=False)
df_test.to_csv("<dir>/test.tsv", sep='\t', index=False, header=False)
df_val.to_csv("<dir>/validation.tsv", sep='\t', index=False, header=False)
pd.DataFrame({"UserId":list(set(list(df_test.UserId.values)))})\
.to_csv("<dir>/test_users.tsv", index=False, header=False)

Wcoo = coo_matrix(W)
Wdf = pd.DataFrame({
    'ItemId':Wcoo.row,
    'WordId':Wcoo.col,
    'Count':Wcoo.data
})
def mix(a, b):
    nx = len(a)
    out=str(nx) + " "
    for i in range(nx):
        out += str(a[i]) + ":" + str(float(b[i])) + " "
    return out
Wdf.groupby('ItemId').agg(lambda x: tuple(x)).apply(lambda x: mix(x['WordId'], x['Count']), axis=1)\
.to_frame().to_csv("<dir>/mult.dat", index=False, header=False)

pd.DataFrame({'col1':np.arange(nwords)}).to_csv("<dir>/vocab.dat", index=False, header=False)

Generating files that look as follows:

train.tsv:

0   0   4.0
0   1   6.0
0   5   5.0
0   7   5.0
0   9   2.0
0   10  5.0

test.tsv:

0   2   1.0
0   4   4.0
0   12  4.0
0   14  3.0
0   16  4.0

validation.tsv

0   23  5.0
0   30  3.0
0   32  1.0
0   33  2.0
0   46  3.0

test_users.tsv:
```
0
1
2
3
4
```
vocab.dat:
```
0
1
2
3
4
5
```

mult.dat:

141 0:2.0 1:4.0 2:1.0 3:2.0 5:1.0 6:2.0 9:2.0 11:1.0 15:2.0 16:3.0 17:4.0 19:3.0 21:1.0 22:4.0 23:1.0 24:3.0 26:1.0 27:1.0 29:1.0 32:3.0 33:2.0 34:1.0 35:2.0 36:1.0 39:2.0 41:1.0 42:6.0 44:1.0 45:2.0 47:1.0 48:1.0 53:5.0 54:2.0 57:1.0 63:6.0 65:1.0 66:2.0 67:1.0 68:1.0 69:1.0 72:1.0 73:1.0 76:5.0 78:1.0 79:5.0 80:1.0 83:2.0 84:3.0 86:1.0 88:5.0 89:1.0 90:4.0 92:1.0 93:2.0 94:1.0 96:2.0 98:1.0 100:4.0 107:2.0 108:1.0 109:2.0 112:2.0 113:4.0 116:1.0 119:1.0 120:2.0 124:3.0 125:7.0 129:2.0 130:1.0 132:3.0 136:1.0 137:1.0 138:3.0 139:2.0 140:1.0 143:4.0 144:2.0 145:2.0 146:10.0 148:2.0 149:2.0 150:1.0 152:4.0 155:6.0 156:2.0 157:3.0 159:2.0 161:4.0 162:1.0 163:2.0 170:1.0 171:1.0 173:3.0 174:4.0 175:3.0 176:1.0 177:1.0 180:2.0 183:1.0 185:1.0 186:2.0 187:4.0 189:1.0 190:2.0 194:1.0 196:2.0 197:2.0 198:2.0 199:4.0 200:3.0 202:2.0 204:1.0 205:1.0 206:1.0 208:1.0 209:1.0 210:3.0 212:2.0 214:1.0 217:1.0 218:1.0 219:2.0 220:1.0 221:1.0 223:2.0 226:1.0 227:1.0 228:1.0 231:1.0 232:4.0 233:4.0 235:1.0 236:2.0 238:3.0 239:1.0 242:1.0 243:1.0 246:4.0 248:2.0 249:2.0 
156 1:1.0 2:1.0 3:3.0 5:2.0 7:1.0 8:1.0 9:1.0 10:1.0 13:1.0 15:2.0 17:1.0 19:2.0 21:3.0 22:3.0 23:2.0 24:1.0 26:1.0 27:1.0 28:1.0 31:1.0 33:1.0 34:5.0 36:2.0 38:1.0 39:4.0 40:1.0 41:1.0 42:1.0 43:4.0 44:2.0 46:2.0 47:3.0 50:1.0 52:1.0 53:3.0 54:2.0 56:2.0 57:1.0 58:4.0 59:2.0 60:3.0 63:1.0 66:1.0 67:2.0 69:2.0 74:2.0 75:2.0 77:1.0 78:3.0 79:1.0 81:3.0 82:2.0 83:1.0 84:3.0 85:2.0 86:3.0 88:2.0 89:3.0 92:1.0 94:1.0 96:1.0 97:2.0 98:1.0 99:3.0 100:1.0 101:2.0 103:1.0 104:1.0 106:3.0 110:1.0 113:1.0 115:1.0 118:2.0 120:4.0 121:3.0 122:1.0 123:3.0 128:1.0 133:3.0 135:1.0 137:1.0 138:2.0 139:2.0 141:1.0 143:2.0 147:1.0 148:2.0 149:1.0 151:1.0 154:1.0 155:4.0 157:1.0 158:1.0 160:4.0 161:2.0 162:5.0 163:1.0 164:5.0 165:1.0 166:1.0 167:4.0 168:3.0 170:1.0 172:1.0 175:1.0 177:1.0 180:4.0 181:1.0 183:1.0 184:1.0 186:1.0 187:1.0 189:1.0 190:5.0 193:2.0 194:3.0 195:7.0 197:2.0 198:2.0 200:1.0 201:1.0 202:2.0 207:2.0 208:2.0 209:1.0 210:3.0 212:8.0 213:2.0 214:2.0 216:1.0 217:1.0 218:1.0 220:4.0 222:1.0 223:1.0 224:2.0 225:4.0 226:1.0 227:1.0 228:6.0 229:3.0 230:1.0 231:1.0 232:1.0 236:2.0 237:1.0 238:2.0 240:2.0 242:1.0 243:2.0 244:2.0 245:2.0 246:3.0 247:6.0 248:2.0 249:2.0

(tried varying between integers and decimals for the values in this last one, but it didn't make a difference)

Which I think seem to fit the description of the files in the main page.

However, after trying to run this program on this data (with and without the last two argments):

collabtm -dir ~/<dir> -nusers 200 -ndocs 300 -nvocab 250 -k 20 -fixeda -lda-init

It starts allocating a lot of memory, until allocating around 8GB, after which it throws bad_alloc and terminates.

Am I missing something?

lcharlin commented 6 years ago

I'm mostly offline until next week but in the spirit of providing a quick response: have you tried integers for the preference observations (i.e., in train/validation/test.tsv)? If that's not it, I'll have a better look once I'm back.

On Sun, Jul 29, 2018 at 2:23 PM david-cortes notifications@github.com wrote:

I've been trying to run this software on an artifically-generated dataset, and I am constantly running out of memory (bad_alloc) even in small datasets.

As an example, I generated the following random data in a Python script:

import numpy as np, pandas as pdfrom scipy.sparse import coo_matrixfrom sklearn.model_selection import train_test_split

nusers = 200 nitems = 300 ntopics = 30 nwords = 250

np.random.seed(1) a=.3 + np.random.gamma(.1, .05) b=.3 + np.random.gamma(.1, .05) c=.3 + np.random.gamma(.1, .05) d=.3 + np.random.gamma(.1, .05) e=.3 + np.random.gamma(.1, .05) f=.5 + np.random.gamma(.1, .05) g=.3 + np.random.gamma(.1, .05) h=.5 + np.random.gamma(.1, .05)

np.random.seed(1) Beta = np.random.gamma(a, b, size=(nwords, ntopics)) Theta = np.random.gamma(c, d, size=(nitems, ntopics)) W = np.random.poisson(Theta.dot(Beta.T) + np.random.gamma(1, 1, size=(nitems, nwords)), size=(nitems, nwords))

Eta = np.random.gamma(e, f, size=(nusers, ntopics)) Epsilon = np.random.gamma(g, h, size=(nitems, ntopics)) R = np.random.poisson(Eta.dot(Theta.T+Epsilon.T) + np.random.gamma(1, 1, size=(nusers, nitems)), size=(nusers, nitems))

Rcoo=coo_matrix(R) df = pd.DataFrame({ 'UserId':Rcoo.row, 'ItemId':Rcoo.col, 'Count':Rcoo.data })

df_train, df_test = train_test_split(df, test_size=0.3, random_state=1) df_test, df_val = train_test_split(df_test, test_size=0.33, random_state=2)

df_train.sort_values(['UserId', 'ItemId'], inplace=True) df_test.sort_values(['UserId', 'ItemId'], inplace=True) df_val.sort_values(['UserId', 'ItemId'], inplace=True)

df_train['Count'] = df_train.Count.values.astype('float32') df_test['Count'] = df_test.Count.values.astype('float32') df_val['Count'] = df_val.Count.values.astype('float32')

df_train.to_csv("
/train.tsv", sep='\t', index=False, header=False) df_test.to_csv("/test.tsv", sep='\t', index=False, header=False) df_val.to_csv("/validation.tsv", sep='\t', index=False, header=False) pd.DataFrame({"UserId":list(set(list(df_test.UserId.values)))})\ .to_csv("/test_users.tsv", index=False, header=False)

Wcoo = coo_matrix(W) Wdf = pd.DataFrame({ 'ItemId':Wcoo.row, 'WordId':Wcoo.col, 'Count':Wcoo.data })def mix(a, b): nx = len(a) out=str(nx) + " " for i in range(nx): out += str(a[i]) + ":" + str(float(b[i])) + " " return out Wdf.groupby('ItemId').agg(lambda x: tuple(x)).apply(lambda x: mix(x['WordId'], x['Count']), axis=1)\ .to_frame().to_csv("
/mult.dat", index=False, header=False)

pd.DataFrame({'col1':np.arange(nwords)}).to_csv("
/vocab.dat", index=False, header=False)

Generating files that look as follows:

train.tsv:

0 0 4.0 0 1 6.0 0 5 5.0 0 7 5.0 0 9 2.0 0 10 5.0

test.tsv:

0 2 1.0 0 4 4.0 0 12 4.0 0 14 3.0 0 16 4.0

validation.tsv

0 23 5.0 0 30 3.0 0 32 1.0 0 33 2.0 0 46 3.0

test_users.tsv:

0 1 2 3 4

vocab.dat:

0 1 2 3 4 5

mult.dat:

141 0:2.0 1:4.0 2:1.0 3:2.0 5:1.0 6:2.0 9:2.0 11:1.0 15:2.0 16:3.0 17:4.0 19:3.0 21:1.0 22:4.0 23:1.0 24:3.0 26:1.0 27:1.0 29:1.0 32:3.0 33:2.0 34:1.0 35:2.0 36:1.0 39:2.0 41:1.0 42:6.0 44:1.0 45:2.0 47:1.0 48:1.0 53:5.0 54:2.0 57:1.0 63:6.0 65:1.0 66:2.0 67:1.0 68:1.0 69:1.0 72:1.0 73:1.0 76:5.0 78:1.0 79:5.0 80:1.0 83:2.0 84:3.0 86:1.0 88:5.0 89:1.0 90:4.0 92:1.0 93:2.0 94:1.0 96:2.0 98:1.0 100:4.0 107:2.0 108:1.0 109:2.0 112:2.0 113:4.0 116:1.0 119:1.0 120:2.0 124:3.0 125:7.0 129:2.0 130:1.0 132:3.0 136:1.0 137:1.0 138:3.0 139:2.0 140:1.0 143:4.0 144:2.0 145:2.0 146:10.0 148:2.0 149:2.0 150:1.0 152:4.0 155:6.0 156:2.0 157:3.0 159:2.0 161:4.0 162:1.0 163:2.0 170:1.0 171:1.0 173:3.0 174:4.0 175:3.0 176:1.0 177:1.0 180:2.0 183:1.0 185:1.0 186:2.0 187:4.0 189:1.0 190:2.0 194:1.0 196:2.0 197:2.0 198:2.0 199:4.0 200:3.0 202:2.0 204:1.0 205:1.0 206:1.0 208:1.0 209:1.0 210:3.0 212:2.0 214:1.0 217:1.0 218:1.0 219:2.0 220:1.0 221:1.0 223:2.0 226:1.0 227:1.0 228:1.0 231:1.0 232:4.0 233:4.0 235:1.0 236:2.0 238:3.0 239:1.0 242:1.0 243:1.0 246:4.0 248:2.0 249:2.0 156 1:1.0 2:1.0 3:3.0 5:2.0 7:1.0 8:1.0 9:1.0 10:1.0 13:1.0 15:2.0 17:1.0 19:2.0 21:3.0 22:3.0 23:2.0 24:1.0 26:1.0 27:1.0 28:1.0 31:1.0 33:1.0 34:5.0 36:2.0 38:1.0 39:4.0 40:1.0 41:1.0 42:1.0 43:4.0 44:2.0 46:2.0 47:3.0 50:1.0 52:1.0 53:3.0 54:2.0 56:2.0 57:1.0 58:4.0 59:2.0 60:3.0 63:1.0 66:1.0 67:2.0 69:2.0 74:2.0 75:2.0 77:1.0 78:3.0 79:1.0 81:3.0 82:2.0 83:1.0 84:3.0 85:2.0 86:3.0 88:2.0 89:3.0 92:1.0 94:1.0 96:1.0 97:2.0 98:1.0 99:3.0 100:1.0 101:2.0 103:1.0 104:1.0 106:3.0 110:1.0 113:1.0 115:1.0 118:2.0 120:4.0 121:3.0 122:1.0 123:3.0 128:1.0 133:3.0 135:1.0 137:1.0 138:2.0 139:2.0 141:1.0 143:2.0 147:1.0 148:2.0 149:1.0 151:1.0 154:1.0 155:4.0 157:1.0 158:1.0 160:4.0 161:2.0 162:5.0 163:1.0 164:5.0 165:1.0 166:1.0 167:4.0 168:3.0 170:1.0 172:1.0 175:1.0 177:1.0 180:4.0 181:1.0 183:1.0 184:1.0 186:1.0 187:1.0 189:1.0 190:5.0 193:2.0 194:3.0 195:7.0 197:2.0 198:2.0 200:1.0 201:1.0 202:2.0 207:2.0 208:2.0 209:1.0 210:3.0 212:8.0 213:2.0 214:2.0 216:1.0 217:1.0 218:1.0 220:4.0 222:1.0 223:1.0 224:2.0 225:4.0 226:1.0 227:1.0 228:6.0 229:3.0 230:1.0 231:1.0 232:1.0 236:2.0 237:1.0 238:2.0 240:2.0 242:1.0 243:2.0 244:2.0 245:2.0 246:3.0 247:6.0 248:2.0 249:2.0

(tried varying between integers and decimals for the values in this last one, but it didn't make a difference)

Which I think seem to fit the description of the files in the main page.

However, after trying to run this program on this data (with and without the last two argments):

collabtm -dir ~/
-nusers 200 -ndocs 300 -nvocab 250 -k 20 -fixeda -lda-init

It starts allocating a lot of memory, until allocating around 8GB, after which it throws bad_alloc and terminates.

Am I missing something?

— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub https://github.com/premgopalan/collabtm/issues/1, or mute the thread https://github.com/notifications/unsubscribe-auth/ABaWSHto9HnQ1oDvWoQZs6n3-P6R2DJMks5uLf2IgaJpZM4VlhDR .

david-cortes commented 6 years ago

Yes, I tried it that way too, but then I get the following error message:

collabtm: matrix.hh:1166: void D2Array<T>::load(std::__cxx11::string, uint32_t, bool) const [with T = double; std::__cxx11::string = std::__cxx11::basic_string<char>; uint32_t = unsigned int]: Assertion `f' failed.
Aborted

lcharlin commented 6 years ago

I took a better look.

The dataset you were generating was too small (it created problems when splitting cold start documents). While this is a limitation of the code, I'm not sure it's worth fixing since this is a toy case at best.
The other thing is that I output integers instead of decimal values.

df_train['Count'] = df_train.Count.values.astype('float32') df_test['Count'] = df_test.Count.values.astype('float32') df_val['Count'] = df_val.Count.values.astype('float32')

df_train['Count'] = df_train.Count.values.astype('int32') df_test['Count'] = df_test.Count.values.astype('int32') df_val['Count'] = df_val.Count.values.astype('int32')

Finally you should remove the -lda-init switch unless you have actually an LDA fit.

I hope that helps.

Best, Laurent On Mon, Jul 30, 2018 at 9:04 AM david-cortes notifications@github.com wrote:

Yes, I tried it that way too, but then I get the following error message:

collabtm: matrix.hh:1166: void D2Array::load(std::__cxx11::string, uint32_t, bool) const [with T = double; std::cxx11::string = std::cxx11::basic_string; uint32_t = unsigned int]: Assertion `f' failed. Aborted

— You are receiving this because you commented. Reply to this email directly, view it on GitHub, or mute the thread.

david-cortes commented 6 years ago

After trying with larger datasets, it seems to run the inference procedure, and what I guess is computing precision metrics, but it still seems to fail due to the cold-start part at the end:

coldstart local inference and HOL
collabtm: matrix.hh:1166: void D2Array<T>::load(std::__cxx11::string, uint32_t, bool) const [with T = double; std::__cxx11::string = std::__cxx11::basic_string<char>; uint32_t = unsigned int]: Assertion `f' failed.
Aborted

lcharlin commented 6 years ago

By default the code needs access to the lda fits for coldstart inference. The log file infer.log (in the experiment's folder) has details about what it's looking for exactly.

Alternatively, you could try commenting out line 2227 in collabtm.cc. Then it should simply re-use the topics learned during the run using the other documents.

Hope that helps!

On Wed, Aug 8, 2018 at 1:36 AM david-cortes notifications@github.com wrote:

After trying with larger datasets, it seems to run the inference procedure, and what I guess is computing precision metrics, but it still seems to fail due to the cold-start part at the end:

coldstart local inference and HOL collabtm: matrix.hh:1166: void D2Array::load(std::__cxx11::string, uint32_t, bool) const [with T = double; std::cxx11::string = std::cxx11::basic_string; uint32_t = unsigned int]: Assertion `f' failed. Aborted

— You are receiving this because you commented. Reply to this email directly, view it on GitHub, or mute the thread.

premgopalan / collabtm

Bad_alloc with even the smallest datasets #1