paypal / NNAnalytics

NameNodeAnalytics is a self-help utility for scouting and maintaining the namespace of an HDFS instance.
Apache License 2.0
110 stars 71 forks source link

Investigate multi-level grouping in Stream API #267

Closed pjeli closed 5 years ago

pjeli commented 5 years ago

Today, other than the experimental /histogram2 API, it is not possible to group by more than 1 other index.

NNA today assumes that all histograms are of the types <String, Long> format. This is the first challenge. /histogram2 deals with this by presenting <String, List<Long>> style histograms. Where all the "values" are just Longs.

Then the second challenge and issue is to look into if it is possible to group by other types and still present a valid histogram to the end-user.

pjeli commented 5 years ago

Was thinking about this further. It should be possible to extend grouping into atleast another 2 or 3 dimensions, if the groupings are by Strings.

It would look like this: Map<String, Map<String, Long>> for the 2-dimensional grouping case. One example like this would be grouping by parentDir at dir depth 3 and then by owner of the file.

Then you would see results like: /a/b/c,user1,1000 /a/b/c,user2,2000 /a/b/d,user1,100 /a/b/d,user2,200 etc...

I think this level of information would still be valuable to people as it would become 1 query instead of running a breakdown per user query using user:eq:<username>.

pjeli commented 5 years ago

I've got a test patch that seems to work with most filters. It's opened my eyes to how much code reduction can be further be done within NNA. Seems I'll need to merge histogram functions and filter functions - but I have it working!

pjeli commented 5 years ago

Attaching screenshot of what multi-level grouping output would look like. This screenshot was done with the following query using the MiniClusters test: http://localhost:4567/histogram2.html?set=files&type=user,fileType&sum=diskspaceConsumed

Screen Shot 2019-09-03 at 9 56 55 AM