piskvorky / gensim

Topic Modelling for Humans
https://radimrehurek.com/gensim
GNU Lesser General Public License v2.1
15.68k stars 4.38k forks source link

Clean up keras code #2886

Open mpenkov opened 4 years ago

mpenkov commented 4 years ago

But looking more closely at both the tested method & test code:

This utility function doesn't really belong cluttering up the KeyedVectors class: it's a fairly trivial bit of data-translation, very specific to some narrow uses, with an extra package requirement. It could & should be a in a keras-specific submodule that's only imported by people using keras, and could take the KeyedVectors instance as an argument (rather than self).

Or, in the spirit of #2852 ("reduce gensim surface area"), it could be moved to a separate package (gensim-extras) - so it doesn't encumber testing/development of core gensim, or removed in favor of a tiny bit of example code, "here's how you convert gensim KeyedVectors into a keras Embedding", in docs, an example notebook, a 'recipes' webpage, or a blog post. It's 2-6 lines of trifling code, but its test failure due to a 'cosmic ray' from another project has soaked up a bunch of attention/time from 4 people who could be making other more important gensim improvements.

Looking at the test_keras_integration.py code, it also failed because it was insufficiently focused on the gensim functionality. It's not checking if the return value is just a translation of gensim values into a legal/sensible Embedding - it's doing some full-cycle Word2Vec training and some full-cycle keras training. It's one of the 20 slowest tests in our whole test suite – sure, only ~9 seconds, but of negligible value usually over the many hundreds/thousands of times it's triggered as a result of other changes, and of negative value in this case.

(Gensim has a bunch of tests that take a lot of time but don't test much real, just completion-without-error, and often which deeply test inner structural details rather than meaningful results, and thus serve as a drag on sensible refactorings when they trigger test failures that require maintenance without tangible benefits to developer or gensim end-users.)

To advance #2852, I'd copy the existing method code to a new wiki page on "using gensim with keras", leave a stub pointing users there, and remove the functionality/tests from gensim so there will be no further maintenance cost.

Originally posted by @gojomo in https://github.com/RaRe-Technologies/gensim/pull/2869#issuecomment-651354811

gojomo commented 4 years ago

Keras-related tests across multiple PRs have again started failing due to unrelated-to-the-PR upstream version changes. Here's an example of 2 tests currently failing on Python 3.7 & Python 3.8:

=================================== FAILURES ===================================
____________ TestKerasWord2VecWrapper.testEmbeddingLayer20NewsGroup ____________
self = <gensim.test.test_keras_integration.TestKerasWord2VecWrapper testMethod=testEmbeddingLayer20NewsGroup>
    def testEmbeddingLayer20NewsGroup(self):
        """
        Test Keras 'Embedding' layer returned by 'get_embedding_layer' function
        for a smaller version of the 20NewsGroup classification problem.
        """
        MAX_SEQUENCE_LENGTH = 1000

        # Prepare text samples and their labels

        # Processing text dataset
        texts = []  # list of text samples
        texts_w2v = []  # used to train the word embeddings
        labels = []  # list of label ids

        data = fetch_20newsgroups(subset='train', categories=['alt.atheism', 'comp.graphics', 'sci.space'])
        for index in range(len(data)):
            label_id = data.target[index]
            file_data = data.data[index]
            i = file_data.find('\n\n')  # skip header
            if i > 0:
                file_data = file_data[i:]
            try:
                curr_str = str(file_data)
                sentence_list = curr_str.split('\n')
                for sentence in sentence_list:
                    sentence = (sentence.strip()).lower()
                    texts.append(sentence)
                    texts_w2v.append(sentence.split(' '))
                    labels.append(label_id)
            except Exception:
                pass

        # Vectorize the text samples into a 2D integer tensor
        tokenizer = Tokenizer()
        tokenizer.fit_on_texts(texts)
        sequences = tokenizer.texts_to_sequences(texts)

        # word_index = tokenizer.word_index
        data = pad_sequences(sequences, maxlen=MAX_SEQUENCE_LENGTH)
        labels = to_categorical(np.asarray(labels))

        x_train = data
        y_train = labels

        # prepare the embedding layer using the wrapper
        keras_w2v = self.model_twenty_ng
        keras_w2v.build_vocab(texts_w2v)
        keras_w2v.train(texts, total_examples=keras_w2v.corpus_count, epochs=keras_w2v.epochs)
        keras_w2v_wv = keras_w2v.wv
        embedding_layer = keras_w2v_wv.get_keras_embedding()

        # create a 1D convnet to solve our classification task
        sequence_input = Input(shape=(MAX_SEQUENCE_LENGTH,), dtype='int32')
>       embedded_sequences = embedding_layer(sequence_input)
MAX_SEQUENCE_LENGTH = 1000
curr_str   = '\n\nIn article <bill.047m@xpresso.UUCP> bill@xpresso.UUCP (Bill Vance) writes:\n>It has been known for quite a while ... Henry Spencer @ U of Toronto Zoology\nbetween SVR3 and SunOS.    - Dick Dunn  |  henry@zoo.toronto.edu  utzoo!henry\n'
data       = array([[  0,   0,   0, ...,   0,   0,   0],
       [  0,   0,   0, ...,   0,   0,   0],
       [  0,   0,   0, ...,   ...,   5,  89, 360],
       [  0,   0,   0, ..., 366, 367,  42],
       [  0,   0,   0, ...,   0,   0,   0]], dtype=int32)
embedding_layer = <keras.layers.embeddings.Embedding object at 0x7fab732f1438>
file_data  = '\n\nIn article <bill.047m@xpresso.UUCP> bill@xpresso.UUCP (Bill Vance) writes:\n>It has been known for quite a while ... Henry Spencer @ U of Toronto Zoology\nbetween SVR3 and SunOS.    - Dick Dunn  |  henry@zoo.toronto.edu  utzoo!henry\n'
i          = 129
index      = 4
keras_w2v  = <gensim.models.word2vec.Word2Vec object at 0x7fab7329d978>
keras_w2v_wv = <gensim.models.keyedvectors.KeyedVectors object at 0x7fab7329d710>
label_id   = 2
labels     = array([[0., 0., 1.],
       [0., 0., 1.],
       [0., 0., 1.],
       [0., 0., 1.],
       [0., 0., 1.],
       [0., 0...      [0., 0., 1.],
       [0., 0., 1.],
       [0., 0., 1.],
       [0., 0., 1.],
       [0., 0., 1.]], dtype=float32)
self       = <gensim.test.test_keras_integration.TestKerasWord2VecWrapper testMethod=testEmbeddingLayer20NewsGroup>
sentence   = ''
sentence_list = ['', '', 'In article <bill.047m@xpresso.UUCP> bill@xpresso.UUCP (Bill Vance) writes:', '>It has been known for quite a... Does anyone make a "globe" that is accurate', '>as to actual shape, landmass configuration/Long/Lat lines etc.?', ...]
sequence_input = <tf.Tensor 'input_1:0' shape=(None, 1000) dtype=int32>
sequences  = [[], [], [], [10, 90, 11, 12, 43, 91, ...], [13, 95, 96, 44, 97, 98, ...], [100, 5, 101, 102, 103, 4, ...], ...]
texts      = ['', '', '', 'i doubt there are good prospects for  a self armoring system', 'for venus surface conditions (several hundred degrees, very high', 'pressure of co2, possibly sulfuric and nitric acids or oxides', ...]
texts_w2v  = [[''], [''], [''], ['i', 'doubt', 'there', 'are', 'good', 'prospects', ...], ['for', 'venus', 'surface', 'conditions', '(several', 'hundred', ...], ['pressure', 'of', 'co2,', 'possibly', 'sulfuric', 'and', ...], ...]
tokenizer  = <keras_preprocessing.text.Tokenizer object at 0x7fabc4b36748>
x_train    = array([[  0,   0,   0, ...,   0,   0,   0],
       [  0,   0,   0, ...,   0,   0,   0],
       [  0,   0,   0, ...,   ...,   5,  89, 360],
       [  0,   0,   0, ..., 366, 367,  42],
       [  0,   0,   0, ...,   0,   0,   0]], dtype=int32)
y_train    = array([[0., 0., 1.],
       [0., 0., 1.],
       [0., 0., 1.],
       [0., 0., 1.],
       [0., 0., 1.],
       [0., 0...      [0., 0., 1.],
       [0., 0., 1.],
       [0., 0., 1.],
       [0., 0., 1.],
       [0., 0., 1.]], dtype=float32)
gensim/test/test_keras_integration.py:128: 
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
.tox/py37-linux/lib/python3.7/site-packages/keras/backend/tensorflow_backend.py:75: in symbolic_fn_wrapper
    return func(*args, **kwargs)
        args       = (<keras.layers.embeddings.Embedding object at 0x7fab732f1438>, <tf.Tensor 'input_1:0' shape=(None, 1000) dtype=int32>)
        func       = <function Layer.__call__ at 0x7fab7dbb5f28>
        kwargs     = {}
.tox/py37-linux/lib/python3.7/site-packages/keras/engine/base_layer.py:446: in __call__
    self.assert_input_compatibility(inputs)
        inputs     = <tf.Tensor 'input_1:0' shape=(None, 1000) dtype=int32>
        kwargs     = {}
        self       = <keras.layers.embeddings.Embedding object at 0x7fab732f1438>
.tox/py37-linux/lib/python3.7/site-packages/keras/engine/base_layer.py:310: in assert_input_compatibility
    K.is_keras_tensor(x)
        inputs     = [<tf.Tensor 'input_1:0' shape=(None, 1000) dtype=int32>]
        self       = <keras.layers.embeddings.Embedding object at 0x7fab732f1438>
        x          = <tf.Tensor 'input_1:0' shape=(None, 1000) dtype=int32>
.tox/py37-linux/lib/python3.7/site-packages/keras/backend/tensorflow_backend.py:695: in is_keras_tensor
    if not is_tensor(x):
        x          = <tf.Tensor 'input_1:0' shape=(None, 1000) dtype=int32>
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
x = <tf.Tensor 'input_1:0' shape=(None, 1000) dtype=int32>
    def is_tensor(x):
>       return isinstance(x, tf_ops._TensorLike) or tf_ops.is_dense_tensor_like(x)
E       AttributeError: module 'tensorflow.python.framework.ops' has no attribute '_TensorLike'
x          = <tf.Tensor 'input_1:0' shape=(None, 1000) dtype=int32>
.tox/py37-linux/lib/python3.7/site-packages/keras/backend/tensorflow_backend.py:703: AttributeError
_____________ TestKerasWord2VecWrapper.testEmbeddingLayerCosineSim _____________
self = <gensim.test.test_keras_integration.TestKerasWord2VecWrapper testMethod=testEmbeddingLayerCosineSim>
    def testEmbeddingLayerCosineSim(self):
        """
        Test Keras 'Embedding' layer returned by 'get_embedding_layer' function for a simple word similarity task.
        """
        keras_w2v_model = self.model_cos_sim
        keras_w2v_model_wv = keras_w2v_model.wv

        embedding_layer = keras_w2v_model_wv.get_keras_embedding()

        input_a = Input(shape=(1,), dtype='int32', name='input_a')
        input_b = Input(shape=(1,), dtype='int32', name='input_b')
>       embedding_a = embedding_layer(input_a)
embedding_layer = <keras.layers.embeddings.Embedding object at 0x7fab732c3710>
input_a    = <tf.Tensor 'input_a:0' shape=(None, 1) dtype=int32>
input_b    = <tf.Tensor 'input_b:0' shape=(None, 1) dtype=int32>
keras_w2v_model = <gensim.models.word2vec.Word2Vec object at 0x7fab732c3f98>
keras_w2v_model_wv = <gensim.models.keyedvectors.KeyedVectors object at 0x7fab732c3978>
self       = <gensim.test.test_keras_integration.TestKerasWord2VecWrapper testMethod=testEmbeddingLayerCosineSim>
gensim/test/test_keras_integration.py:58: 
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
.tox/py37-linux/lib/python3.7/site-packages/keras/backend/tensorflow_backend.py:75: in symbolic_fn_wrapper
    return func(*args, **kwargs)
        args       = (<keras.layers.embeddings.Embedding object at 0x7fab732c3710>, <tf.Tensor 'input_a:0' shape=(None, 1) dtype=int32>)
        func       = <function Layer.__call__ at 0x7fab7dbb5f28>
        kwargs     = {}
.tox/py37-linux/lib/python3.7/site-packages/keras/engine/base_layer.py:446: in __call__
    self.assert_input_compatibility(inputs)
        inputs     = <tf.Tensor 'input_a:0' shape=(None, 1) dtype=int32>
        kwargs     = {}
        self       = <keras.layers.embeddings.Embedding object at 0x7fab732c3710>
.tox/py37-linux/lib/python3.7/site-packages/keras/engine/base_layer.py:310: in assert_input_compatibility
    K.is_keras_tensor(x)
        inputs     = [<tf.Tensor 'input_a:0' shape=(None, 1) dtype=int32>]
        self       = <keras.layers.embeddings.Embedding object at 0x7fab732c3710>
        x          = <tf.Tensor 'input_a:0' shape=(None, 1) dtype=int32>
.tox/py37-linux/lib/python3.7/site-packages/keras/backend/tensorflow_backend.py:695: in is_keras_tensor
    if not is_tensor(x):
        x          = <tf.Tensor 'input_a:0' shape=(None, 1) dtype=int32>
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
x = <tf.Tensor 'input_a:0' shape=(None, 1) dtype=int32>
    def is_tensor(x):
>       return isinstance(x, tf_ops._TensorLike) or tf_ops.is_dense_tensor_like(x)
E       AttributeError: module 'tensorflow.python.framework.ops' has no attribute '_TensorLike'
x          = <tf.Tensor 'input_a:0' shape=(None, 1) dtype=int32>
.tox/py37-linux/lib/python3.7/site-packages/keras/backend/tensorflow_backend.py:703: AttributeError

There's essentially just one line of substantive keras code in gensim:

https://github.com/RaRe-Technologies/gensim/blob/d5556ea2700333e07c8605385def94dd96fb2c94/gensim/models/keyedvectors.py#L1623-L1626

Moving that line & the code which tests/demos it to a Wiki page, or other demo/tutorial notebook outside of the main unit testing cycle, could save multiple other non-Keras-using gensim-contributors hours of frustration as they have to review/work-around errors unrelated to their PRs.

piskvorky commented 4 years ago

Keras tests should be already disabled, see #2908 .

gojomo commented 4 years ago

That's helpful! (My PRs based on code from 2 days ago don't yet add that disablement, but I'll bring it in.)

But, the case for removing this 1-line-of-library-code, & accompanying 100+ lines of overbroad & now repeatedly failure-prone testing code, remains strong. Moving it to a notebook demo would make #2908 ("re-enabling" the source of a recurring problem) moot, and end the low-value game of whack-a-mole against their API evolution.