tensorflow / transform

Input pipeline framework
Apache License 2.0
984 stars 214 forks source link

Allow SparseTensors to be SparseFeatures #182

Closed ghost closed 3 years ago

ghost commented 4 years ago

I would like to have a preprocessing function that maps a list of indices in a large feature space (input as a VarLenFeature) to SparseTensors in that space. For example:

def preprocessing_fn(inputs):
    outputs = {}

    full_dim = 70_000
    inp = inputs['col_ind_list']

    row_inds = inp.indices[:,0]
    col_inds = inp.values
    inds = tf.transpose(tf.stack([row_inds, col_inds]))

    values = tf.ones(tf.shape(inp.values))

    batch_size = tf.shape(inp)[0]
    dense_shape = [batch_size, full_dim]
    outputs['sparse_input'] = tf.SparseTensor(indices=inds, values=values, dense_shape=dense_shape)

    return outputs

[In practice full_dim would be defined as tft.max(inputs['col_ind_list'])] Given a sparse input whose values represent indices in the [0, 70_000] space:

indices = [[0,0],[1,0],[1,1],[1,2],[1,3],[1,4],[1,5],[1,6],[1,7],[1,8],[1,9],[1,10],[2,0]]
values = tf.constant([56787,1773,3257,15147,18653,19355,19395,25733,33313,41131,51146,56224,10938],
                     dtype=tf.int64)
dense_shape = [3,11]
x = {'col_ind_list': tf.SparseTensor(indices=indices, values=values, dense_shape=dense_shape)}
tf.print(x['col_ind_list'])
'SparseTensor(indices=[[0 0]
 [1 0]
 [1 1]
 ...
 [1 9]
 [1 10]
 [2 0]], values=[56787 1773 3257 ... 51146 56224 10938], shape=[3 11])'

The preprocessing function will map this into 70_000-dim space:

tf.print(preprocessing_fn(x)['sparse_input'])
'SparseTensor(indices=[[0 56787]
 [1 1773]
 [1 3257]
 ...
 [1 51146]
 [1 56224]
 [2 10938]], values=[1 1 1 ... 1 1 1], shape=[3 70000])'

However, using this preprocessing_fn in Transform fails with the following error: ValueError: Encountered a SparseTensorValue that cannot be decoded by ListColumnRepresentation. The source of the error makes clear that this is because the values are being mapped to a VarLenFeature rather than a SparseFeature: https://github.com/tensorflow/transform/blob/92bf190485aa8efbaba418b944da2f9399107c90/tensorflow_transform/impl_helper.py#L309-L330 Looking at the output schema inference, all SparseTensors are indeed mapped to VarLenFeatures: https://github.com/tensorflow/transform/blob/e3aecb4eaca1e0848dd7f3e2c9bbaae3ba161f2f/tensorflow_transform/schema_inference.py#L62-L68

So it looks like SparseTensors are only really expected to be used to allow for different instance lengths within a batch (essentially ragged tensors), not for instances that are sparse themselves.

I'm seeing a TODO on line 326 of impl_helper that seems to suggest there is a plan to open up support for multi-dimensional SparseTensors (i.e. the first case of impl_helper line 315). Would you consider implementing that change in the output schema inference, and also yielding SparseFeatures in the second case of impl_helper 315 where instance indices are not [0, len(indices)]?

Or if not, what about an option to specify an output Schema rather than having Transform infer it?

Thanks!

zoyahav commented 4 years ago

Your conclusion is correct, SparseTensors in TFT are used to allow features with different numbers of values.

Would multi-dimensional SparseTensor outputs solve your issue, or do you need to have full SparseTensor support as well? (not only left-aligned)

ghost commented 4 years ago

Thanks @zoyahav. I would need full SparseTensor support, not only left-aligned.

zoyahav commented 4 years ago

I see, would you consider outputting (batched) values, indices, dense_shape from the preprocessing_fn as left aligned SparseTensors and then composing the full SparseTensor outside of TFT?

ghost commented 4 years ago

My initial impulse was to put all of the preprocessing in Transform but you are absolutely right that SparseTensor composition could be kicked down the pipeline. One added advantage of that approach would be that a TF serving model deployment input signature would be more similar to the Transform output signature.

misael-manjarres commented 3 years ago

Though that is a good option, it would still be good to implement the feature suggested above. I have the same problem, but it in my case I would like to output the LABEL of my data as the SparseFeature. Making it a sparse vector downstream is therefore more difficult.

An example of why this would be needed is a high dimensional multi-label classification, where the labels are stored as [non-zero indices, values] which is the natural sparse tensor format.

zoyahav commented 3 years ago

Following up here, as of tf.transform 0.28 release, it has full SparseTensor support. Feel free to open new issues if you run into issues with it.

EdwardCuiPeacock commented 2 years ago
if len(indices.shape) > 1 or np.any(indices != np.arange(len(indices))): 
       raise ValueError('Encountered a SparseTensorValue that cannot be ' 
                        'decoded by ListColumnRepresentation.\n' 
                        '"{}" : {}'.format(name, value)) 

As of version 1.7.0, I still see this line exists, which enforces that the sparse tensor to have a ragged shape. We encountered the same error above in 1.0.0. So it appears this issue is not fixed yet.

zoyahav commented 2 years ago

This snippet is for the varlen case (tf.io.varlen) which TFT encodes as a left-aligned sparse tensor for backwards compatibility reasons. For now, in order to force TFT to interpret it as true Sparse, please try to add an empty dimension to it (tf.expand_dims(x, -1)) let us know if this works. We are working on a backwards compatible way to allow users to annotate single dim sparse tensors as non-varlen, but that's not ready yet.