StatisticsGen treats zeros as missing data after FileBasedExampleGen with parquet_executor

If the bug is related to a specific library below, please raise an issue in the respective repo directly:

System information

Have I specified the code to reproduce the issue (Yes, No):Yes
Environment in which the code is executed (e.g., Local(Linux/MacOS/Windows), Interactive Notebook, Google Cloud, etc): Linux, AWS EC2 instance, jupyer notebook
TensorFlow version: 2.13.1
TFX Version:1.14.0
Python version: 3.9.18
Python dependencies (from pip freeze output):

Describe the current behavior

Describe the expected behavior

Standalone code to reproduce the issue ` import pandas as pd import numpy as np import string import sys import tensorflow as tf from tfx import v1 as tfx from tfx.orchestration.experimental.interactive.interactive_context import InteractiveContext from google.protobuf.json_format import MessageToDict from tfx.components import FileBasedExampleGen, CsvExampleGen from tfx.components.example_gen.custom_executors import parquet_executor from tfx.dsl.components.base import executor_spec

arr_random = np.random.randint(low=0, high=3, size=(100,5)) columns = list(string.ascii_uppercase[0:5]) df = pd.DataFrame(arr_random, columns=columns) df.to_parquet('./gen_data/lots_of_zeros.parquet', index=False) _pipeline_root = './pipeline/' _data_root = './gen_data/' context = InteractiveContext(pipeline_root=_pipeline_root) custom_executor_spec = executor_spec.BeamExecutorSpec(parquet_executor.Executor) example_gen = FileBasedExampleGen(input_base=_data_root, custom_executor_spec=custom_executor_spec)

context.run(example_gen) statistics_gen = tfx.components.StatisticsGen( examples=example_gen.outputs['examples'])

context.run(statistics_gen) context.show(statistics_gen.outputs['statistics'])

` Visually inspecting this result, I find for the numeric features the following errrors:

the fraction of missing is bigger than zero (which is worng, there are no missing),
the fraction of zeros is 0% (which is wrong, there are several zeros).
the mean value is incorrect
the standard deviation is incorrect
The min value is not 0, as it should be, but rather 1.
the median value is wrong (as it doesn't count how manu zeros are in the data)

What I think is happening is that the FileBasedExampleGen crates sparse representation of the parquet input file, and the statisticsGen interpets it as if there are no zeros in the input file.

This is in contrast to the CsvExampleGen, that for the same input (but saved as csv), has no missing values, shows the correct number of zeros, and shows the correct statistics.

Providing a bare minimum test case or step(s) to reproduce the problem will greatly help us to debug the issue. If possible, please share a link to Colab/Jupyter/any notebook.

Name of your Organization (Optional) cpacket Other info / logs

Include any logs or source code that would be helpful to diagnose the problem. If including tracebacks, please include the full traceback. Large logs and files should be attached. Screenshot 2023-10-30 at 1 51 03 PM

tensorflow / tfx

StatisticsGen treats zeros as missing data after FileBasedExampleGen with parquet_executor #6407