tensorflow / transform

Input pipeline framework
Apache License 2.0
984 stars 214 forks source link

apply_vocabulary lookup table initialization needs to be wrapped inside `tf.init_scope` #249

Open EdwardCuiPeacock opened 3 years ago

EdwardCuiPeacock commented 3 years ago

We recently encountered scalability issues when trying to apply the vocabularies for multiple (5 to be exact) categorical features. We saw multiple lines of the follwoing warning message:

WARNING:tensorflow:Tables initialized inside a tf.function will be re-initialized on every invocation of the function. This re-initialization can have significant impact on performance. Consider lifting them out of the graph context using `tf.init_scope`.

When using the tft.apply_vocabulary, the job would stuck on the transformation steps for hours, consuming thousands of CPU hours if we do not kill it early.

Creating a custom lookup table initialization function like the following could bypass the proble; 80M rows of data only took 35 min, consuming ~20 hours of CPU time.

def create_file_lookup(filename):
    with tf.init_scope():
        initializer = tf.lookup.TextFileInitializer(
            filename,
            key_dtype=tf.string, 
            key_index=tf.lookup.TextFileIndex.WHOLE_LINE, 
            value_dtype=tf.int64, 
            value_index=tf.lookup.TextFileIndex.LINE_NUMBER,
            value_index_offset=1, # starting from 1
        )
        table = tf.lookup.StaticHashTable(initializer, 0)

    return table

Relevant code need to be addressed: https://github.com/tensorflow/transform/blob/520ebb492c2f687ff30cce22261938037384b26d/tensorflow_transform/mappers.py#L1114

This probably needs to be applied to versions of TFT starting from 1.0

cyc commented 3 years ago

@EdwardCuiPeacock I don't quite understand how you are getting that warning message, it appears as though tf.init_scope is being used here: https://github.com/tensorflow/transform/blob/b06e87b74246d968bf63d41425a75936d55bb2d2/tensorflow_transform/tf_utils.py#L1578

varshaan commented 2 years ago

As Chris commented above, we do enter the tf.init_scope in tf_utils. Are you passing a lookup_fn to tft.apply_vocabulary? If that is the case, you would need to lift the table creation inside that lookup_fn as TFT does not have access to the table creation code to do this automatically.

Could you give me an example of what your calls to tft.vocabulary and tft.apply_vocabulary look like so I can take a look and see if we missed something?

gfkeith commented 2 years ago

I've encountered the same warning while using TFX, in my case the init_scope is not used because the check https://github.com/tensorflow/transform/blob/9b348c8862bfd8bfd462db596111103e58d15006/tensorflow_transform/tf_utils.py#L1654-L1655 fails as asset_filepath is <class 'tensorflow.python.framework.ops.Tensor'>.

My usage is:

transformed = tft.compute_and_apply_vocabulary(
    input_tensor,
    frequency_threshold=100,
    num_oov_buckets=1,
    vocab_filename='tags'
)

It isn't a big issue for me as my vocabulary is small, so I haven't spent much time looking into it and don't have a proper MRE, but maybe that much is helpful.

gfkeith commented 2 years ago

I've just checked using this notebook: https://colab.research.google.com/github/tensorflow/tfx/blob/master/docs/tutorials/tfx/components_keras.ipynb#scrollTo=jHfhth_GiZI9, and the warning appears with it also. Possibly related to being wrapped with tf.function (which occurs in TFX Transform), so the vocab_filename is not an ops.EagerTensor? It gets wrapped here-ish https://github.com/tensorflow/transform/blob/520ebb492c2f687ff30cce22261938037384b26d/tensorflow_transform/analyzers.py#L1973 into a tensor, so is no longer a string either.

IzakMaraisTAL commented 11 months ago

Similar to @gfkeith , I also get this warning when using tft.compute_and_apply_vocabulary() inside a TFX Transform component's preprocessing_fn.