tensorflow / transform

Input pipeline framework
Apache License 2.0
985 stars 215 forks source link

BoolDomain not supported in tf_metadata.schema_utils._set_domain #111

Closed TimSmole closed 5 years ago

TimSmole commented 5 years ago

I am having trouble reading schema a simple schema that contains only one boolean feature. The schema was generated with tfdv.infer_schema function and saved to file using tfdv.write_schema_text function. The file looks like this:

feature {
  name: "test"
  value_count {
    min: 1
    max: 1
  }
  type: INT
  bool_domain {
  }
  presence {
    min_fraction: 1.0
    min_count: 1
  }
}

When I try to read it in the following way:

from tensorflow_transform.tf_metadata import metadata_io

metadata_io.read_metadata(".") # "." is path to directory where schema.pbtxt is stored

I get the following ValueError:

ValueError                                Traceback (most recent call last)
<ipython-input-39-68264a3c7170> in <module>
      9 
     10 from tensorflow_transform.tf_metadata import metadata_io
---> 11 metadata_io.read_metadata(".")

/usr/local/lib/python3.5/dist-packages/tensorflow_transform/tf_metadata/metadata_io.py in read_metadata(path)
     49   features_spec, domains = schema_utils.schema_as_feature_spec(schema_proto)
     50   return dataset_metadata.DatasetMetadata(
---> 51       dataset_schema.from_feature_spec(features_spec, domains))
     52 
     53 

/usr/local/lib/python3.5/dist-packages/tensorflow_transform/tf_metadata/dataset_schema.py in from_feature_spec(feature_spec, domains)
    151   column_schemas = {name: (domains.get(name), spec)
    152                     for name, spec in feature_spec.items()}
--> 153   return Schema(column_schemas)

/usr/local/lib/python3.5/dist-packages/tensorflow_transform/tf_metadata/dataset_schema.py in __init__(self, column_schemas)
     45                if domain is not None}
     46     self._schema_proto = schema_utils.schema_from_feature_spec(
---> 47         feature_spec, domains)
     48 
     49   def __eq__(self, other):

/usr/local/lib/python3.5/dist-packages/tensorflow_transform/tf_metadata/schema_utils.py in schema_from_feature_spec(feature_spec, domains)
     69     else:
     70       result.feature.add().CopyFrom(
---> 71           _feature_from_feature_spec(spec, name, domains))
     72   return result
     73 

/usr/local/lib/python3.5/dist-packages/tensorflow_transform/tf_metadata/schema_utils.py in _feature_from_feature_spec(spec, name, domains)
    127 
    128   _set_type(name, feature, spec.dtype)
--> 129   _set_domain(name, feature, domains.get(name))
    130   return feature
    131 

/usr/local/lib/python3.5/dist-packages/tensorflow_transform/tf_metadata/schema_utils.py in _set_domain(name, feature, domain)
    157   else:
    158     raise ValueError(
--> 159         'Feature "{}" has invalid domain {}'.format(name, domain))
    160 
    161 

ValueError: Feature "test" has invalid domain 

Debugging shows that _set_domain function supports only IntDomain, StringDomain and FloatDomain, but not BoolDomain and other (StructDomain, NaturalLanguageDomain, ImageDomain, MIDDomain, URLDomain, TimeDomain).

Is there a reason for this? Is this still work in progress, should it be considered as a bug or am I missing something?

I am using:

tensorflow-data-validation==0.13.1 tensorflow-transform==0.13.0

cyc commented 5 years ago

@TimSmole I have encountered this as well, and for now the workaround is just to remove these fields from your schema before saving it.

for feature in schema.feature:
    if feature.HasField('bool_domain'):
        feature.ClearField('bool_domain')
zoyahav commented 5 years ago

Thanks for reporting this, we're aware of this issue and will be working with the tensorflow-data-validation team to resolve it.

This was a breaking change in tensorflow-trasnform 0.9, from the release notes: https://github.com/tensorflow/transform/releases/tag/v0.9.0

We now validate a Schema in its constructor to make sure that it can be converted to a feature spec. In particular only tf.int64, tf.string and tf.float32 types are allowed.

This limitation can be worked around for now by for example using an integer type (tf.int64) and validating that the values are either 0 or 1. int_domain { min: 0 max: 1 }

gowthamkpr commented 5 years ago

Closing this issue as it has been answered. Please update the issue when new information becomes available, and we will reopen the issue. Thanks!