tensorflow / data-validation

Library for exploring and validating machine learning data
Apache License 2.0
766 stars 174 forks source link

Issue using `allowlist_features` and `denylist_features` in `visualize_statistics` #212

Open wronk opened 2 years ago

wronk commented 2 years ago

Overview

I'm having issues specifying the features to include/exclude when visualizing stats in TFDV. It seems like the allowlist_features and denylist_features require a tensorflow_data_validation.types.FeaturePath object, which took a bit to figure out how to construct. This doesn't seem that user friendly -- was it intended to allow a list of strings to be passed?

Code to reproduce

I can reproduce the problem in the public colab example. In the "Compute and Visualize Statistics" section of the above notebook, update the visualize_statistics call to be: tfdv.visualize_statistics(train_stats, denylist_features=['pickup_community_area']). The first feature shouldn't exist in the visualized example (if I'm calling this correctly).

image

Workaround code

To make this work, I have to manually construct a tensorflow_data_validation.types.FeaturePath object. Perhaps it would be better to do the filter comparison on each feature's path string?

# Show string name of feature
first_feat = train_stats.datasets[0].features[0]
print(first_feat.path)

# Construct necessary object to make `allowlist_feature` filter work
from tensorflow_data_validation import types
print(types.FeaturePath.from_proto(first_feat.path))

# docs-infra: no-execute
tfdv.visualize_statistics(train_stats, allowlist_features=[types.FeaturePath.from_proto(first_feat.path)])
image