tensorflow / transform

Input pipeline framework
Apache License 2.0
984 stars 214 forks source link

Error when dataset_schema.FixedColumnRepresentation has default_value == 0 #97

Closed AdrianLsk closed 5 years ago

AdrianLsk commented 5 years ago

It's caused by this piece of code:

if spec.default_value is not None:
      raise ValueError(
          'feature "{}" had default_value {}, but FixedLenFeature must have '
          'default_value=None'.format(name, spec.default_value))

Link to source

What's the rationale behind enforcing default_value=None? Usually a default value corresponding to the data type should be OK. It seems like there is a bug in the condition and it should be without not: if spec.default_value is None:

KesterTong commented 5 years ago

The code is correct: we don't support FixedLenFeature with a default value because such a schema is transforming the data (by filling in the default value). Having transformations occur during parsing will prevent us from sharing code with other TFX components. If you would like to do something similar, please see the treatment of the features in OPTIONAL_NUMERIC_FEATURE_KEYS in https://github.com/tensorflow/transform/blob/master/examples/census_example.py

wulikai1993 commented 4 years ago

@KesterTong The OPTIONAL_NUMERIC_FEATURE_KEYS is VarLenFeature, and its value is SparseTensor, I don't make sense of it. Is it means that if a feature has a missing value, it should be considered as a VarLenFeature? And its value should be a SparseTensor? If not, how to cope with the missing value of FixedLenFeature? Thank you!

KesterTong commented 4 years ago

@wulikai1993 I no longer work on Transform, @zoyahav may be able to help here

zoyahav commented 4 years ago

tf.Transform doesn't support missing values. It supports features with a varying number of values by representing it as a tf.SparseTensor (shaped like a ragged tensor). Missing values can be handled before Transform, after reading the data.

wulikai1993 commented 4 years ago

@zoyahav The official tutorial handles missing values in the preprocessing_fn, is it the recommended way?

Missing values can be handled before Transform

I think preprocessing_fn occurs in Transform.

%%skip_for_export
%%writefile {_taxi_transform_module_file}

import tensorflow as tf
import tensorflow_transform as tft

import taxi_constants

_DENSE_FLOAT_FEATURE_KEYS = taxi_constants.DENSE_FLOAT_FEATURE_KEYS
_VOCAB_FEATURE_KEYS = taxi_constants.VOCAB_FEATURE_KEYS
_VOCAB_SIZE = taxi_constants.VOCAB_SIZE
_OOV_SIZE = taxi_constants.OOV_SIZE
_FEATURE_BUCKET_COUNT = taxi_constants.FEATURE_BUCKET_COUNT
_BUCKET_FEATURE_KEYS = taxi_constants.BUCKET_FEATURE_KEYS
_CATEGORICAL_FEATURE_KEYS = taxi_constants.CATEGORICAL_FEATURE_KEYS
_FARE_KEY = taxi_constants.FARE_KEY
_LABEL_KEY = taxi_constants.LABEL_KEY
_transformed_name = taxi_constants.transformed_name

def preprocessing_fn(inputs):
  """tf.transform's callback function for preprocessing inputs.
  Args:
    inputs: map from feature keys to raw not-yet-transformed features.
  Returns:
    Map from string feature key to transformed feature operations.
  """
  outputs = {}
  for key in _DENSE_FLOAT_FEATURE_KEYS:
    # Preserve this feature as a dense float, setting nan's to the mean.
    outputs[_transformed_name(key)] = tft.scale_to_z_score(
        _fill_in_missing(inputs[key]))

  for key in _VOCAB_FEATURE_KEYS:
    # Build a vocabulary for this feature.
    outputs[_transformed_name(key)] = tft.compute_and_apply_vocabulary(
        _fill_in_missing(inputs[key]),
        top_k=_VOCAB_SIZE,
        num_oov_buckets=_OOV_SIZE)

  for key in _BUCKET_FEATURE_KEYS:
    outputs[_transformed_name(key)] = tft.bucketize(
        _fill_in_missing(inputs[key]), _FEATURE_BUCKET_COUNT,
        always_return_num_quantiles=False)

  for key in _CATEGORICAL_FEATURE_KEYS:
    outputs[_transformed_name(key)] = _fill_in_missing(inputs[key])

  # Was this passenger a big tipper?
  taxi_fare = _fill_in_missing(inputs[_FARE_KEY])
  tips = _fill_in_missing(inputs[_LABEL_KEY])
  outputs[_transformed_name(_LABEL_KEY)] = tf.where(
      tf.math.is_nan(taxi_fare),
      tf.cast(tf.zeros_like(taxi_fare), tf.int64),
      # Test if the tip was > 20% of the fare.
      tf.cast(
          tf.greater(tips, tf.multiply(taxi_fare, tf.constant(0.2))), tf.int64))

  return outputs

def _fill_in_missing(x):
  """Replace missing values in a SparseTensor.
  Fills in missing values of `x` with '' or 0, and converts to a dense tensor.
  Args:
    x: A `SparseTensor` of rank 2.  Its dense shape should have size at most 1
      in the second dimension.
  Returns:
    A rank 1 tensor where missing values of `x` have been filled in.
  """
  default_value = '' if x.dtype == tf.string else 0
  return tf.squeeze(
      tf.sparse.to_dense(
          tf.SparseTensor(x.indices, x.values, [x.dense_shape[0], 1]),
          default_value),
      axis=1)
zoyahav commented 4 years ago

No, this snippet assumes that your inputs are tf.SparseTensors that are left-aligned as I mentioned (shaped like ragged tensors). If you'd like to allow for missing values at any index, you'll have to handle that in a beam PTransform outside of the preprocessing_fn.

Since you linked to the TFX tutorial, I'll assume you're actually using tf.Transform within TFX, which doesn't let you modify the beam graph that easily. I'd suggest asking this on the TFX repo issues instead as that would allow us to give you the most relevant answers. Generally speaking, if you're using TFX the easiest way may be using a custom ExampleGen, but that wouldn't solve the issue at serving time.

wulikai1993 commented 4 years ago

Thanks for your reply! I need to learn more about TFX and relative components.

agonojo commented 4 years ago

Thanks for your reply! I need to learn more about TFX and relative components.

wulikai1993 how did you personally end up handling this? Did you proceed as Zoyahav had recommended and deal with missing values in beam? If so, perhaps you can link the same question if you asked this on the TFX repo issues board as well? I will check.

wulikai1993 commented 4 years ago

@agonojo I didn't ask this on TFX repo. I just changed FixedLenFeature to VarLenFeature, and it worked.

RuhuaJiang commented 4 years ago

"Having transformations occur during parsing will prevent us from sharing code with other TFX components. "

is there some fundamental issue with that approach or that could be solved with some effort?

it is very unintuitive and feels a hack to be forced to use VarLenFeature (which gives us a SparseTensor but we need a Dense Tensor) for dealing with missing value

zoyahav commented 4 years ago

Since your data is batched, it may be that some instances in the batch are present while other are missing. There's no way to generically represent such data with a dense Tensor. Ideally TFT would provide RaggedTensors for VarLenFeatures, since that would likely be more intuitive, but that's not currently the case. If it would be easier for you to handle RaggedTensors instead of SparseTensors in your preprocesing_fn, you can definitely convert them using tf.RaggedTensor.from_sparse().

RuhuaJiang commented 4 years ago

Thanks @zoyahav , i understand that part.

I wasn't clear enough. I was basically challenging why not supporting FixedLenFeature with a default value, especially this argument

Having transformations occur during parsing will prevent us from sharing code with other TFX components.

cc @pavanky