maciejkula / glove-python

Toy Python implementation of http://www-nlp.stanford.edu/projects/glove/
Apache License 2.0
1.25k stars 319 forks source link

Index overflow with large matrix? (ValueError: negative dimensions are not allowed) #28

Closed dreamflasher closed 8 years ago

dreamflasher commented 9 years ago

For a cooc matrix with dimensionality 1.5 million * 1.5 million I get the following error:

  File "build/bdist.linux-x86_64/egg/glove/corpus.py", line 70, in fit
    max_map_size)
  File "glove/corpus_cython.pyx", line 279, in glove.corpus_cython.construct_cooccurrence_matrix (glove/corpus_cython.cpp:3465)
  File "glove/corpus_cython.pyx", line 145, in glove.corpus_cython.Matrix.to_coo (glove/corpus_cython.cpp:2290)
ValueError: negative dimensions are not allowed

It's a bit weird cause the dimensions use integer and int.max >> (1.5 mio)^2.

maciejkula commented 9 years ago

Do you know how many non-zero elements you have? It could be that you are overflowing the 32 bit index. On 6 Apr 2015 06:45, "DreamFlasher" notifications@github.com wrote:

For a cooc matrix with dimensionality 1.5 million * 1.5 million I get the following error:

File "build/bdist.linux-x86_64/egg/glove/corpus.py", line 70, in fit max_map_size) File "glove/corpus_cython.pyx", line 279, in glove.corpus_cython.construct_cooccurrence_matrix (glove/corpus_cython.cpp:3465) File "glove/corpus_cython.pyx", line 145, in glove.corpus_cython.Matrix.to_coo (glove/corpus_cython.cpp:2290)ValueError: negative dimensions are not allowed

— Reply to this email directly or view it on GitHub https://github.com/maciejkula/glove-python/issues/28.

dreamflasher commented 9 years ago

47 million, so that should be alright?

maciejkula commented 9 years ago

Hmmm. ((15*10**6)**2) > 2**31, so maybe it'll break if you are on a 32-bit architecture?

maciejkula commented 9 years ago

In particular, yo would get this error if no_collocations = self.size() overflows. Can you check the C size of your ints?

dreamflasher commented 9 years ago

I am on a 64 bit architecture. How do I check the C size of my ints?

maciejkula commented 9 years ago

Try putting print sizeof(int) and print no_collocations after no_collocations = self.size() is called in the Cython file, then re-install the package. That should print it when running.

dreamflasher commented 9 years ago

Ok this is interesting, for sizeof(int) I get "4". Why is this 4 and not 8 on a 64bit machine? Are there some compiler options for cython? (I use anaconda 64 bit) self.size() is 1281524832

dreamflasher commented 9 years ago

Awesome we are getting there, I subsampled before when I increase the sampling rate I now get: 4 -565999539 So yeah I guess I have a nnz overflow. I'll have a look if I can easily change the code to longs or get cython for 64bit.

dreamflasher commented 9 years ago

Changed it to long, long has 8 byte on this machine, but still overflow :/

maciejkula commented 9 years ago

Did you also change the function cdef int size(self): (and the variables within it) to long?

dreamflasher commented 9 years ago

yes

maciejkula commented 9 years ago

no_collocations on line 142 should then be a positive number?

dreamflasher commented 9 years ago

it should but isn't.

maciejkula commented 9 years ago

Can you push your modifications to a branch somewhere so that I could have a look?

dreamflasher commented 9 years ago

Thank you very much, for having a look! Let me post the changes here, I really only changed the two ints by longs:

cdef int size(self):
    """
    Get number of nonzero entries.
    """

    cdef int i
    cdef long size = 0

    for i in range(self.rows.size()):
        size += self.rows[i].size()
        size += self.row_indices[i].size()

        if self.rows[i].size() < 0:
            print self.rows[i].size()
        if self.row_indices[i].size() < 0:
            print self.row_indices[i].size()

    return size

cpdef to_coo(self, int shape):
    """
    Convert to a shape by shape COO matrix.
    """

    cdef int i, j
    cdef int row
    cdef int col
    cdef int rows = self.rows.size()
    cdef long no_collocations

    # Transform all row maps to row arrays.
    for i in range(rows):
        self.compactify_row(i)

    no_collocations = self.size()
    print sizeof(int)
    print sizeof(long)
    print no_collocations
maciejkula commented 9 years ago

That looks fine. Can you run this python -c 'import sys;print("%s" % sys.maxsize, sys.maxsize > 2**32)' and paste the answer?

dreamflasher commented 9 years ago
('9223372036854775807', True)

It's strange. I don't believe it, let me compile and run it again. There shouldn't be an overflow.

maciejkula commented 9 years ago

Ah. Can you change the function signature to cdef long size(self)?

dreamflasher commented 9 years ago

Ahh fail, thank you!

maciejkula commented 9 years ago

Let me know if it works!

dreamflasher commented 9 years ago

yes, it will take a while, but I will let you know.

dreamflasher commented 9 years ago

Okay this part works now (no_collocations=3067526914). but now I get zsh: segmentation fault Which is most probably due to OOM. Any ideas?

maciejkula commented 9 years ago

At least we are getting somewhere.

I have this PR which (1) should be more memory-efficient and (2) can use memory-mapped arrays: https://github.com/maciejkula/glove-python/pull/27

Do you want to have a look? You'd have to modify it again to use long (or long long) for the sizes.

dreamflasher commented 9 years ago

This is an awesome PR! I will check it out. I was also thinking about using a CSR matrix instead of COO as it should decrease memory usage (and it's similar to the data structure you already use anyways).

maciejkula commented 9 years ago

Actually, it still uses COO as that is much more convenient for model fitting. But it should be more memory efficient.