Closed dreamflasher closed 8 years ago
Do you know how many non-zero elements you have? It could be that you are overflowing the 32 bit index. On 6 Apr 2015 06:45, "DreamFlasher" notifications@github.com wrote:
For a cooc matrix with dimensionality 1.5 million * 1.5 million I get the following error:
File "build/bdist.linux-x86_64/egg/glove/corpus.py", line 70, in fit max_map_size) File "glove/corpus_cython.pyx", line 279, in glove.corpus_cython.construct_cooccurrence_matrix (glove/corpus_cython.cpp:3465) File "glove/corpus_cython.pyx", line 145, in glove.corpus_cython.Matrix.to_coo (glove/corpus_cython.cpp:2290)ValueError: negative dimensions are not allowed
— Reply to this email directly or view it on GitHub https://github.com/maciejkula/glove-python/issues/28.
47 million, so that should be alright?
Hmmm. ((15*10**6)**2) > 2**31
, so maybe it'll break if you are on a 32-bit architecture?
In particular, yo would get this error if no_collocations = self.size()
overflows. Can you check the C size of your ints?
I am on a 64 bit architecture. How do I check the C size of my ints?
Try putting print sizeof(int)
and print no_collocations
after no_collocations = self.size()
is called in the Cython file, then re-install the package. That should print it when running.
Ok this is interesting, for sizeof(int) I get "4". Why is this 4 and not 8 on a 64bit machine? Are there some compiler options for cython? (I use anaconda 64 bit) self.size() is 1281524832
Awesome we are getting there, I subsampled before when I increase the sampling rate I now get: 4 -565999539 So yeah I guess I have a nnz overflow. I'll have a look if I can easily change the code to longs or get cython for 64bit.
Changed it to long, long has 8 byte on this machine, but still overflow :/
Did you also change the function cdef int size(self):
(and the variables within it) to long
?
yes
no_collocations
on line 142 should then be a positive number?
it should but isn't.
Can you push your modifications to a branch somewhere so that I could have a look?
Thank you very much, for having a look! Let me post the changes here, I really only changed the two ints by longs:
cdef int size(self):
"""
Get number of nonzero entries.
"""
cdef int i
cdef long size = 0
for i in range(self.rows.size()):
size += self.rows[i].size()
size += self.row_indices[i].size()
if self.rows[i].size() < 0:
print self.rows[i].size()
if self.row_indices[i].size() < 0:
print self.row_indices[i].size()
return size
cpdef to_coo(self, int shape):
"""
Convert to a shape by shape COO matrix.
"""
cdef int i, j
cdef int row
cdef int col
cdef int rows = self.rows.size()
cdef long no_collocations
# Transform all row maps to row arrays.
for i in range(rows):
self.compactify_row(i)
no_collocations = self.size()
print sizeof(int)
print sizeof(long)
print no_collocations
That looks fine. Can you run this python -c 'import sys;print("%s" % sys.maxsize, sys.maxsize > 2**32)'
and paste the answer?
('9223372036854775807', True)
It's strange. I don't believe it, let me compile and run it again. There shouldn't be an overflow.
Ah. Can you change the function signature to cdef long size(self)
?
Ahh fail, thank you!
Let me know if it works!
yes, it will take a while, but I will let you know.
Okay this part works now (no_collocations=3067526914). but now I get zsh: segmentation fault
Which is most probably due to OOM. Any ideas?
At least we are getting somewhere.
I have this PR which (1) should be more memory-efficient and (2) can use memory-mapped arrays: https://github.com/maciejkula/glove-python/pull/27
Do you want to have a look? You'd have to modify it again to use long
(or long long
) for the sizes.
This is an awesome PR! I will check it out. I was also thinking about using a CSR matrix instead of COO as it should decrease memory usage (and it's similar to the data structure you already use anyways).
Actually, it still uses COO as that is much more convenient for model fitting. But it should be more memory efficient.
For a cooc matrix with dimensionality 1.5 million * 1.5 million I get the following error:
It's a bit weird cause the dimensions use integer and int.max >> (1.5 mio)^2.