Closed zexuan-zhou closed 1 year ago
A hot key is to some extent expected here, since we need to do a global combine at the end of the pipeline.
We do use hot_key_fanout to help mitigate this [1], but the fanout factor isn't wired to any option. A larger value might help, but I suspect wouldn't dramatically.
There's also an options experimental_num_feature_partitions that was intended to help speed up the final combine by partitioning the feature space and then only globally combining the (small) result protos. It worked well on synthetic dataset with many features, but hasn't really helped real world performance. You might try setting that to a nonzero value on the order of the number of workers you have, but I'm not sure it will help here.
I'm assuming this is a problem because it's slowing down the pipeline walltime?
Are you processing many (~1000s) of features, and are you using slicing here?
Thank you.
Yes I have ~1000 features, and the code to generate statistics is just a one liner
import tensorflow_data_validation as tfdv
...
... tfdv.GenerateStatistics(...) ...
and I think GenerateStatistics calls GenerateStatisticsImpl if by slicing that's what you mean.
As for experimental_num_feature_partitions, thanks I'll give it a try
Hi, @zexuan-zhou
Apologies for the delay and as @zwestrick mentioned above option experimental_num_feature_partitions
If > 1, partitions computations by supported generators to act on this many bundles of features. For best results this should be set to at least several times less than the number of features in a dataset, and never more than the available beam parallelism and to combine the result there is option of experimental_result_partitions
is The number of feature partitions to combine output DatasetFeatureStatisticsLists into. If set to 1 (default) output is globally combined. If set to value greater than one, up to that many shards are returned, each containing a subset of features.
You can refer our complete GenerateStatistics
code and we are using hot_key_fanout
here and you can also refer Options for generating statistics official documentation here, I hope it will help you to resolve your issue
Could you please confirm if this issue is resolved for you? Please feel free to close the issue if it is resolved ?
If issue still persists please let us know ?
Thank you!
Hi, @zexuan-zhou
Closing this issue due to lack of recent activity for couple of weeks. Please feel free to reopen the issue or post comments, if you need any further assistance or update
Thank you!
Hi,
Cool thanks. I’m using the GenerateStatistics API and I’m running into hot key issues
GenerateStatistics(EVALUATION)/RunStatsGenerators/GenerateSlicedStatisticsImpl/RunCombinerStatsGenerators[0]/CombinePerKey(PostCombineFn)/GroupByKey/Read+GenerateStatistics(EVALUATION)/RunStatsGenerators/GenerateSlicedStatisticsImpl/RunCombinerStatsGenerators[0]/CombinePerKey(PostCombineFn)/Combine+GenerateStatistics(EVALUATION)/RunStatsGenerators/GenerateSlicedStatisticsImpl/RunCombinerStatsGenerators[0]/CombinePerKey(PostCombineFn)/Combine/Extract+GenerateStatistics(EVALUATION)/RunStatsGenerators/GenerateSlicedStatisticsImpl/FlattenFeatureStatistics/OutputIdentity+GenerateStatistics(EVALUATION)/RunStatsGenerators/GenerateSlicedStatisticsImpl/AddSliceKeyToStatsProto+GenerateStatistics(EVALUATION)/RunStatsGenerators/GenerateSlicedStatisticsImpl/ToList/ToList/KeyWithVoid+GenerateStatistics(EVALUATION)/RunStatsGenerators/GenerateSlicedStatisticsImpl/ToList/ToList/CombinePerKey/GroupByKey+GenerateStatistics(EVALUATION)/RunStatsGenerators/GenerateSlicedStatisticsImpl/ToList/ToList/CombinePerKey/Combine/Partial+GenerateStatistics(EVALUATION)/RunStatsGenerators/GenerateSlicedStatisticsImpl/ToList/ToList/CombinePerKey/GroupByKey/Write
I wonder if tensorflow data validation has some functionalities to mitigate this. I looked at StatsOptions but I do not see any hot key related parameter
Unfortunately I do not have a good example of code snippet for reproducible result. So this is a general question
Thanks