Closed ghost closed 3 years ago
Your conclusion is correct, SparseTensors in TFT are used to allow features with different numbers of values.
Would multi-dimensional SparseTensor outputs solve your issue, or do you need to have full SparseTensor support as well? (not only left-aligned)
Thanks @zoyahav. I would need full SparseTensor support, not only left-aligned.
I see, would you consider outputting (batched) values, indices, dense_shape from the preprocessing_fn as left aligned SparseTensors and then composing the full SparseTensor outside of TFT?
My initial impulse was to put all of the preprocessing in Transform but you are absolutely right that SparseTensor composition could be kicked down the pipeline. One added advantage of that approach would be that a TF serving model deployment input signature would be more similar to the Transform output signature.
Though that is a good option, it would still be good to implement the feature suggested above. I have the same problem, but it in my case I would like to output the LABEL of my data as the SparseFeature. Making it a sparse vector downstream is therefore more difficult.
An example of why this would be needed is a high dimensional multi-label classification, where the labels are stored as [non-zero indices, values] which is the natural sparse tensor format.
Following up here, as of tf.transform 0.28 release, it has full SparseTensor support. Feel free to open new issues if you run into issues with it.
if len(indices.shape) > 1 or np.any(indices != np.arange(len(indices))):
raise ValueError('Encountered a SparseTensorValue that cannot be '
'decoded by ListColumnRepresentation.\n'
'"{}" : {}'.format(name, value))
As of version 1.7.0, I still see this line exists, which enforces that the sparse tensor to have a ragged shape. We encountered the same error above in 1.0.0. So it appears this issue is not fixed yet.
This snippet is for the varlen case (tf.io.varlen) which TFT encodes as a left-aligned sparse tensor for backwards compatibility reasons.
For now, in order to force TFT to interpret it as true Sparse, please try to add an empty dimension to it (tf.expand_dims(x, -1)
) let us know if this works.
We are working on a backwards compatible way to allow users to annotate single dim sparse tensors as non-varlen, but that's not ready yet.
I would like to have a preprocessing function that maps a list of indices in a large feature space (input as a
VarLenFeature
) toSparseTensor
s in that space. For example:[In practice
full_dim
would be defined astft.max(inputs['col_ind_list'])
] Given a sparse input whose values represent indices in the[0, 70_000]
space:The preprocessing function will map this into 70_000-dim space:
However, using this
preprocessing_fn
in Transform fails with the following error:ValueError: Encountered a SparseTensorValue that cannot be decoded by ListColumnRepresentation.
The source of the error makes clear that this is because the values are being mapped to aVarLenFeature
rather than aSparseFeature
: https://github.com/tensorflow/transform/blob/92bf190485aa8efbaba418b944da2f9399107c90/tensorflow_transform/impl_helper.py#L309-L330 Looking at the output schema inference, allSparseTensor
s are indeed mapped toVarLenFeature
s: https://github.com/tensorflow/transform/blob/e3aecb4eaca1e0848dd7f3e2c9bbaae3ba161f2f/tensorflow_transform/schema_inference.py#L62-L68So it looks like
SparseTensor
s are only really expected to be used to allow for different instance lengths within a batch (essentially ragged tensors), not for instances that are sparse themselves.I'm seeing a TODO on line 326 of
impl_helper
that seems to suggest there is a plan to open up support for multi-dimensionalSparseTensor
s (i.e. the first case ofimpl_helper
line 315). Would you consider implementing that change in the output schema inference, and also yieldingSparseFeature
s in the second case ofimpl_helper
315 where instance indices are not[0, len(indices)]
?Or if not, what about an option to specify an output Schema rather than having Transform infer it?
Thanks!