piskvorky / gensim

Topic Modelling for Humans
https://radimrehurek.com/gensim
GNU Lesser General Public License v2.1
15.72k stars 4.38k forks source link

BUG: word2vec skipgram model wont work with numpy array #3394

Open isimsizolan opened 2 years ago

isimsizolan commented 2 years ago

Problem description

i have language with 240 distinct words. Because of it can fit 1 byte, i have map each word to bytes and save them in numpy uint8 array to minimize memory footprint. Doing this significantly reduce memory consumtion. However, due to "gensim\models\word2vec_inner.pyx", line 542, numpy arrays cant be used and throws: "The truth value of an array with more than one element is ambiguous. Use a.any() or a.all()" error

Related line checks if sentence is empty or not, however it doing it as "if not sent:" More generic checker if len(sent)==0: will fix the problem.

Work around is, casting numpy array to python list. However this significantly increase memory footprint and time consuming operation on big dataset.

What are you trying to achieve? What is the expected result? What are you seeing instead?

Steps/code/corpus to reproduce

reproduce:

class SentenceIterator:
    def __init__(self, dataset):
        self.dataset = dataset

    def __iter__(self):
        for sentence in self.dataset:
            yield sentence

data= []
data.append(np.array([22,33,44,55,1,2,3,5,4,100]))
data.append(np.array([100,100,100,100,11]))

sentences = SentenceIterator(data)
model = gensim.models.Word2Vec(sentences, vector_size=32, window=3, workers=4, sg=1, negative=10)

ps: casting np.array to python list fixes the issue however casting is very slow on big dataset and significantly increases memory footprint

**workaround:**
class SentenceIterator:
    def __init__(self, dataset):
        self.dataset = dataset

    def __iter__(self):
        for sentence in self.dataset:
            yield sentence.tolist()

possible fix

changing "if not sent:" controls to "if len(sent) ==0:"

Versions

Python 3.9.13 (main, Oct 13 2022, 21:23:06) [MSC v.1916 64 bit (AMD64)] :: Anaconda, Inc. on win32 Type "help", "copyright", "credits" or "license" for more information.

import platform; print(platform.platform()) Windows-10-10.0.19044-SP0 import sys; print("Python", sys.version) Python 3.9.13 (main, Oct 13 2022, 21:23:06) [MSC v.1916 64 bit (AMD64)] import struct; print("Bits", 8 * struct.calcsize("P")) Bits 64 import numpy; print("NumPy", numpy.version) NumPy 1.23.4 import scipy; print("SciPy", scipy.version) SciPy 1.9.3 import gensim; print("gensim", gensim.version) gensim 4.2.0 from gensim.models import word2vec;print("FAST_VERSION", word2vec.FAST_VERSION) FAST_VERSION 0

gojomo commented 2 years ago

My discussion of the specific user issue on the discussion group: https://groups.google.com/g/gensim/c/1N46yGjvu6w/m/6Cev5FW3CgAJ

This isn't really a 'bug', as the interface has never aspired to accepting texts as anything other than lists-of-strings.

As a feature-request, I'm unsure it'd be worth the extra complication to accept more kinds of text-representations, from a corpus-iterable. It'd need more discussion.

I've sometimes thought we could benefit from a refactoring where internally there's a clear boundary/alternate-entry-point where the training takes vocab-indexes rather than lists-of-strings, which could be convenient for advanced users, or perhaps some advanced modes, that want to provide their own (perhaps precalculated & memory-mapped once) token-to-index lookup. But that wouldn't necessarily take things like arbitrary arrays/ndarrays at the corpus-iterable level.