tensorflow / data-validation

Library for exploring and validating machine learning data
Apache License 2.0
766 stars 174 forks source link

hot key issue #230

Closed zexuan-zhou closed 1 year ago

zexuan-zhou commented 1 year ago

Hi,

Cool thanks. I’m using the GenerateStatistics API and I’m running into hot key issues

GenerateStatistics(EVALUATION)/RunStatsGenerators/GenerateSlicedStatisticsImpl/RunCombinerStatsGenerators[0]/CombinePerKey(PostCombineFn)/GroupByKey/Read+GenerateStatistics(EVALUATION)/RunStatsGenerators/GenerateSlicedStatisticsImpl/RunCombinerStatsGenerators[0]/CombinePerKey(PostCombineFn)/Combine+GenerateStatistics(EVALUATION)/RunStatsGenerators/GenerateSlicedStatisticsImpl/RunCombinerStatsGenerators[0]/CombinePerKey(PostCombineFn)/Combine/Extract+GenerateStatistics(EVALUATION)/RunStatsGenerators/GenerateSlicedStatisticsImpl/FlattenFeatureStatistics/OutputIdentity+GenerateStatistics(EVALUATION)/RunStatsGenerators/GenerateSlicedStatisticsImpl/AddSliceKeyToStatsProto+GenerateStatistics(EVALUATION)/RunStatsGenerators/GenerateSlicedStatisticsImpl/ToList/ToList/KeyWithVoid+GenerateStatistics(EVALUATION)/RunStatsGenerators/GenerateSlicedStatisticsImpl/ToList/ToList/CombinePerKey/GroupByKey+GenerateStatistics(EVALUATION)/RunStatsGenerators/GenerateSlicedStatisticsImpl/ToList/ToList/CombinePerKey/Combine/Partial+GenerateStatistics(EVALUATION)/RunStatsGenerators/GenerateSlicedStatisticsImpl/ToList/ToList/CombinePerKey/GroupByKey/Write

I wonder if tensorflow data validation has some functionalities to mitigate this. I looked at StatsOptions but I do not see any hot key related parameter

Unfortunately I do not have a good example of code snippet for reproducible result. So this is a general question

Thanks

zwestrick commented 1 year ago

A hot key is to some extent expected here, since we need to do a global combine at the end of the pipeline.

We do use hot_key_fanout to help mitigate this [1], but the fanout factor isn't wired to any option. A larger value might help, but I suspect wouldn't dramatically.

There's also an options experimental_num_feature_partitions that was intended to help speed up the final combine by partitioning the feature space and then only globally combining the (small) result protos. It worked well on synthetic dataset with many features, but hasn't really helped real world performance. You might try setting that to a nonzero value on the order of the number of workers you have, but I'm not sure it will help here.

I'm assuming this is a problem because it's slowing down the pipeline walltime?

Are you processing many (~1000s) of features, and are you using slicing here?

[1] https://github.com/tensorflow/data-validation/blob/master/tensorflow_data_validation/statistics/stats_impl.py#L278

zexuan-zhou commented 1 year ago

Thank you.

Yes I have ~1000 features, and the code to generate statistics is just a one liner

import tensorflow_data_validation as tfdv
...

... tfdv.GenerateStatistics(...) ...

and I think GenerateStatistics calls GenerateStatisticsImpl if by slicing that's what you mean.

As for experimental_num_feature_partitions, thanks I'll give it a try

gaikwadrahul8 commented 1 year ago

Hi, @zexuan-zhou

Apologies for the delay and as @zwestrick mentioned above option experimental_num_feature_partitions If > 1, partitions computations by supported generators to act on this many bundles of features. For best results this should be set to at least several times less than the number of features in a dataset, and never more than the available beam parallelism and to combine the result there is option of experimental_result_partitions is The number of feature partitions to combine output DatasetFeatureStatisticsLists into. If set to 1 (default) output is globally combined. If set to value greater than one, up to that many shards are returned, each containing a subset of features.

You can refer our complete GenerateStatistics code and we are using hot_key_fanout here and you can also refer Options for generating statistics official documentation here, I hope it will help you to resolve your issue

Could you please confirm if this issue is resolved for you? Please feel free to close the issue if it is resolved ?

If issue still persists please let us know ?

Thank you!

gaikwadrahul8 commented 1 year ago

Hi, @zexuan-zhou

Closing this issue due to lack of recent activity for couple of weeks. Please feel free to reopen the issue or post comments, if you need any further assistance or update

Thank you!