paypal / NNAnalytics

NameNodeAnalytics is a self-help utility for scouting and maintaining the namespace of an HDFS instance.
Apache License 2.0
110 stars 71 forks source link

Investigate ContentSummary vs current implementation for Directories analysis #312

Closed pjeli closed 2 years ago

pjeli commented 2 years ago

Once #311 is implemented we can investigate whether it is better to leave directory analysis (our largest computation in terms of time) as it is, or to instead use ContentSummary for performing the computation.

The battle would ultimately come down to, is it better to start at file level and work our way up to depth 3, or is it better run ContentSummary on every directory from depth 3 and deeper? Unclear - though I am tempted to say the current implementation is likely better since it does not deal with locks or (too much) Object instantiation.

pjeli commented 2 years ago

I just ran a version of this on my own. Unfortunately there was not a speed-up to be gained here. I will throw in a patch file here for anyone that wants to play around with this. But the loss of performance is more than 3x so I will be sticking to the current implementation for now.

Patch: BenchmarkSuggestionsEngineDirectories.patch.txt

BenchmarkSuggestionsEngine.benchmarkDirectoriesByContentSummary                               avgt    5         8671.651 �       1276.491   ms/op
BenchmarkSuggestionsEngine.benchmarkDirectoriesByParent                                       avgt    5         2568.796 �        393.260   ms/op