Open david-cortes opened 6 years ago
I'm mostly offline until next week but in the spirit of providing a quick response: have you tried integers for the preference observations (i.e., in train/validation/test.tsv)? If that's not it, I'll have a better look once I'm back.
On Sun, Jul 29, 2018 at 2:23 PM david-cortes notifications@github.com wrote:
I've been trying to run this software on an artifically-generated dataset, and I am constantly running out of memory (bad_alloc) even in small datasets.
As an example, I generated the following random data in a Python script:
import numpy as np, pandas as pdfrom scipy.sparse import coo_matrixfrom sklearn.model_selection import train_test_split
nusers = 200 nitems = 300 ntopics = 30 nwords = 250
np.random.seed(1) a=.3 + np.random.gamma(.1, .05) b=.3 + np.random.gamma(.1, .05) c=.3 + np.random.gamma(.1, .05) d=.3 + np.random.gamma(.1, .05) e=.3 + np.random.gamma(.1, .05) f=.5 + np.random.gamma(.1, .05) g=.3 + np.random.gamma(.1, .05) h=.5 + np.random.gamma(.1, .05)
np.random.seed(1) Beta = np.random.gamma(a, b, size=(nwords, ntopics)) Theta = np.random.gamma(c, d, size=(nitems, ntopics)) W = np.random.poisson(Theta.dot(Beta.T) + np.random.gamma(1, 1, size=(nitems, nwords)), size=(nitems, nwords))
Eta = np.random.gamma(e, f, size=(nusers, ntopics)) Epsilon = np.random.gamma(g, h, size=(nitems, ntopics)) R = np.random.poisson(Eta.dot(Theta.T+Epsilon.T) + np.random.gamma(1, 1, size=(nusers, nitems)), size=(nusers, nitems))
Rcoo=coo_matrix(R) df = pd.DataFrame({ 'UserId':Rcoo.row, 'ItemId':Rcoo.col, 'Count':Rcoo.data })
df_train, df_test = train_test_split(df, test_size=0.3, random_state=1) df_test, df_val = train_test_split(df_test, test_size=0.33, random_state=2)
df_train.sort_values(['UserId', 'ItemId'], inplace=True) df_test.sort_values(['UserId', 'ItemId'], inplace=True) df_val.sort_values(['UserId', 'ItemId'], inplace=True)
df_train['Count'] = df_train.Count.values.astype('float32') df_test['Count'] = df_test.Count.values.astype('float32') df_val['Count'] = df_val.Count.values.astype('float32')
df_train.to_csv("
/train.tsv", sep='\t', index=False, header=False) df_test.to_csv(" /test.tsv", sep='\t', index=False, header=False) df_val.to_csv(" /validation.tsv", sep='\t', index=False, header=False) pd.DataFrame({"UserId":list(set(list(df_test.UserId.values)))})\ .to_csv(" /test_users.tsv", index=False, header=False) Wcoo = coo_matrix(W) Wdf = pd.DataFrame({ 'ItemId':Wcoo.row, 'WordId':Wcoo.col, 'Count':Wcoo.data })def mix(a, b): nx = len(a) out=str(nx) + " " for i in range(nx): out += str(a[i]) + ":" + str(float(b[i])) + " " return out Wdf.groupby('ItemId').agg(lambda x: tuple(x)).apply(lambda x: mix(x['WordId'], x['Count']), axis=1)\ .to_frame().to_csv("
/mult.dat", index=False, header=False) pd.DataFrame({'col1':np.arange(nwords)}).to_csv("
/vocab.dat", index=False, header=False) Generating files that look as follows:
- train.tsv:
0 0 4.0 0 1 6.0 0 5 5.0 0 7 5.0 0 9 2.0 0 10 5.0
- test.tsv:
0 2 1.0 0 4 4.0 0 12 4.0 0 14 3.0 0 16 4.0
- validation.tsv
0 23 5.0 0 30 3.0 0 32 1.0 0 33 2.0 0 46 3.0
- test_users.tsv:
0 1 2 3 4
- vocab.dat:
0 1 2 3 4 5
- mult.dat:
141 0:2.0 1:4.0 2:1.0 3:2.0 5:1.0 6:2.0 9:2.0 11:1.0 15:2.0 16:3.0 17:4.0 19:3.0 21:1.0 22:4.0 23:1.0 24:3.0 26:1.0 27:1.0 29:1.0 32:3.0 33:2.0 34:1.0 35:2.0 36:1.0 39:2.0 41:1.0 42:6.0 44:1.0 45:2.0 47:1.0 48:1.0 53:5.0 54:2.0 57:1.0 63:6.0 65:1.0 66:2.0 67:1.0 68:1.0 69:1.0 72:1.0 73:1.0 76:5.0 78:1.0 79:5.0 80:1.0 83:2.0 84:3.0 86:1.0 88:5.0 89:1.0 90:4.0 92:1.0 93:2.0 94:1.0 96:2.0 98:1.0 100:4.0 107:2.0 108:1.0 109:2.0 112:2.0 113:4.0 116:1.0 119:1.0 120:2.0 124:3.0 125:7.0 129:2.0 130:1.0 132:3.0 136:1.0 137:1.0 138:3.0 139:2.0 140:1.0 143:4.0 144:2.0 145:2.0 146:10.0 148:2.0 149:2.0 150:1.0 152:4.0 155:6.0 156:2.0 157:3.0 159:2.0 161:4.0 162:1.0 163:2.0 170:1.0 171:1.0 173:3.0 174:4.0 175:3.0 176:1.0 177:1.0 180:2.0 183:1.0 185:1.0 186:2.0 187:4.0 189:1.0 190:2.0 194:1.0 196:2.0 197:2.0 198:2.0 199:4.0 200:3.0 202:2.0 204:1.0 205:1.0 206:1.0 208:1.0 209:1.0 210:3.0 212:2.0 214:1.0 217:1.0 218:1.0 219:2.0 220:1.0 221:1.0 223:2.0 226:1.0 227:1.0 228:1.0 231:1.0 232:4.0 233:4.0 235:1.0 236:2.0 238:3.0 239:1.0 242:1.0 243:1.0 246:4.0 248:2.0 249:2.0 156 1:1.0 2:1.0 3:3.0 5:2.0 7:1.0 8:1.0 9:1.0 10:1.0 13:1.0 15:2.0 17:1.0 19:2.0 21:3.0 22:3.0 23:2.0 24:1.0 26:1.0 27:1.0 28:1.0 31:1.0 33:1.0 34:5.0 36:2.0 38:1.0 39:4.0 40:1.0 41:1.0 42:1.0 43:4.0 44:2.0 46:2.0 47:3.0 50:1.0 52:1.0 53:3.0 54:2.0 56:2.0 57:1.0 58:4.0 59:2.0 60:3.0 63:1.0 66:1.0 67:2.0 69:2.0 74:2.0 75:2.0 77:1.0 78:3.0 79:1.0 81:3.0 82:2.0 83:1.0 84:3.0 85:2.0 86:3.0 88:2.0 89:3.0 92:1.0 94:1.0 96:1.0 97:2.0 98:1.0 99:3.0 100:1.0 101:2.0 103:1.0 104:1.0 106:3.0 110:1.0 113:1.0 115:1.0 118:2.0 120:4.0 121:3.0 122:1.0 123:3.0 128:1.0 133:3.0 135:1.0 137:1.0 138:2.0 139:2.0 141:1.0 143:2.0 147:1.0 148:2.0 149:1.0 151:1.0 154:1.0 155:4.0 157:1.0 158:1.0 160:4.0 161:2.0 162:5.0 163:1.0 164:5.0 165:1.0 166:1.0 167:4.0 168:3.0 170:1.0 172:1.0 175:1.0 177:1.0 180:4.0 181:1.0 183:1.0 184:1.0 186:1.0 187:1.0 189:1.0 190:5.0 193:2.0 194:3.0 195:7.0 197:2.0 198:2.0 200:1.0 201:1.0 202:2.0 207:2.0 208:2.0 209:1.0 210:3.0 212:8.0 213:2.0 214:2.0 216:1.0 217:1.0 218:1.0 220:4.0 222:1.0 223:1.0 224:2.0 225:4.0 226:1.0 227:1.0 228:6.0 229:3.0 230:1.0 231:1.0 232:1.0 236:2.0 237:1.0 238:2.0 240:2.0 242:1.0 243:2.0 244:2.0 245:2.0 246:3.0 247:6.0 248:2.0 249:2.0
(tried varying between integers and decimals for the values in this last one, but it didn't make a difference)
Which I think seem to fit the description of the files in the main page.
However, after trying to run this program on this data (with and without the last two argments):
collabtm -dir ~/
-nusers 200 -ndocs 300 -nvocab 250 -k 20 -fixeda -lda-init It starts allocating a lot of memory, until allocating around 8GB, after which it throws bad_alloc and terminates.
Am I missing something?
— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub https://github.com/premgopalan/collabtm/issues/1, or mute the thread https://github.com/notifications/unsubscribe-auth/ABaWSHto9HnQ1oDvWoQZs6n3-P6R2DJMks5uLf2IgaJpZM4VlhDR .
Yes, I tried it that way too, but then I get the following error message:
collabtm: matrix.hh:1166: void D2Array<T>::load(std::__cxx11::string, uint32_t, bool) const [with T = double; std::__cxx11::string = std::__cxx11::basic_string<char>; uint32_t = unsigned int]: Assertion `f' failed.
Aborted
I took a better look.
The dataset you were generating was too small (it created problems when splitting cold start documents). While this is a limitation of the code, I'm not sure it's worth fixing since this is a toy case at best.
The other thing is that I output integers instead of decimal values.
df_train['Count'] = df_train.Count.values.astype('float32') df_test['Count'] = df_test.Count.values.astype('float32') df_val['Count'] = df_val.Count.values.astype('float32')
df_train['Count'] = df_train.Count.values.astype('int32') df_test['Count'] = df_test.Count.values.astype('int32') df_val['Count'] = df_val.Count.values.astype('int32')
I hope that helps.
Best, Laurent On Mon, Jul 30, 2018 at 9:04 AM david-cortes notifications@github.com wrote:
Yes, I tried it that way too, but then I get the following error message:
collabtm: matrix.hh:1166: void D2Array
::load(std::__cxx11::string, uint32_t, bool) const [with T = double; std::cxx11::string = std::cxx11::basic_string ; uint32_t = unsigned int]: Assertion `f' failed. Aborted — You are receiving this because you commented. Reply to this email directly, view it on GitHub, or mute the thread.
After trying with larger datasets, it seems to run the inference procedure, and what I guess is computing precision metrics, but it still seems to fail due to the cold-start part at the end:
coldstart local inference and HOL
collabtm: matrix.hh:1166: void D2Array<T>::load(std::__cxx11::string, uint32_t, bool) const [with T = double; std::__cxx11::string = std::__cxx11::basic_string<char>; uint32_t = unsigned int]: Assertion `f' failed.
Aborted
By default the code needs access to the lda fits for coldstart inference. The log file infer.log (in the experiment's folder) has details about what it's looking for exactly.
Alternatively, you could try commenting out line 2227 in collabtm.cc. Then it should simply re-use the topics learned during the run using the other documents.
Hope that helps!
On Wed, Aug 8, 2018 at 1:36 AM david-cortes notifications@github.com wrote:
After trying with larger datasets, it seems to run the inference procedure, and what I guess is computing precision metrics, but it still seems to fail due to the cold-start part at the end:
coldstart local inference and HOL collabtm: matrix.hh:1166: void D2Array
::load(std::__cxx11::string, uint32_t, bool) const [with T = double; std::cxx11::string = std::cxx11::basic_string ; uint32_t = unsigned int]: Assertion `f' failed. Aborted — You are receiving this because you commented. Reply to this email directly, view it on GitHub, or mute the thread.
I've been trying to run this software on an artifically-generated dataset, and I am constantly running out of memory (
bad_alloc
) even in small datasets.As an example, I generated the following random data in a Python script:
Generating files that look as follows:
train.tsv:
test.tsv:
validation.tsv
test_users.tsv:
vocab.dat:
mult.dat:
(tried varying between integers and decimals for the values in this last one, but it didn't make a difference)
Which I think seem to fit the description of the files in the main page.
However, after trying to run this program on this data (with and without the last two argments):
It starts allocating a lot of memory, until allocating around 8GB, after which it throws
bad_alloc
and terminates.Am I missing something?