paypal / NNAnalytics

NameNodeAnalytics is a self-help utility for scouting and maintaining the namespace of an HDFS instance.
Apache License 2.0
109 stars 71 forks source link

Provide metrics for Hive tables and partitions #242

Open pjeli opened 5 years ago

pjeli commented 5 years ago

Assuming a valid hive-site.xml, it will be possible to determine the active hive warehouse HDFS directory and HiveServer2 and Metastore URIs.

From there we should be able to perform a directory analysis on the hive warehouse parent directory and then all HDFS locations that represent tables / partitions.

kunalmulwani commented 5 years ago

What are the different metrics?

pjeli commented 5 years ago

Hey @kunalmulwani !

Apologies in advance as Hive is the one of the applications I still struggle with.

My current thinking is that it should be pretty easy to count the number of internal tables and do some sorting based on simple total file counts and diskspace consumed for each table; (as each table will be represented as just a directory). From there we can then do age analysis and sort internal tables by the last access / mod time.

That way folks can get an idea for which internal tables should be considered for possible HAR archival or deletion.

The above wasn't really possible before because of how long directory histograms used to take on large NNA instances. But ever since #224 was done I've felt comfortable with extending the analysis that NNA can perform into Hive, HBase, etc.

Now we can extend the same idea to external tables but that will require some communication with the Metastore and the SQL database backing it -- so I'd like to focus on that part later as it will be harder.

Make sense?

kunalmulwani commented 5 years ago

I understand what you say. I would like to work on this but I might need help to achieve this.

pjeli commented 5 years ago

Sure @kunalmulwani !

I think all that is required is: (1) Parsing the hive-site.ml and determining where the warehouse directory is. (2) Within the SuggestionsEngine, use the QueryEngine to get all directories directly underneath the warehouse directory, something like: http://SERVER:PORT/filter?set=dirs&filters=path:startsWith:/warehouse/dir/ (3) Each of these directories should be an internal hive database directory. Underneath that is each table for each database. So we should be able to get file count and diskspace consumed per DB and per table (again, for internal ones only).

PoojaShekhar commented 4 years ago

We can also parse hive metastore logs to get the stats numFiles, uncompressed datasize(rawDataSize), compressed datasize(totalSize), numPartitions, table hdfs location for each table given that stats are collected every time there an update on the existing tables or a new table is created. Auto gather setting needs to be turned on. Or we can script it to do collect stats later using the analyze query.

pjeli commented 2 years ago

I think it's finally time to revisit this. Sorry for getting back to you so late @PoojaShekhar. I think if you want to parse Hive MetaStore logs you can do so - but we should avoid doing it as part of NNA. NNA is, ideally, an isolated system, and we should avoid having to talk to as many other things as we need to. In this case, we have all the metadata necessary to figure everything out within the NameNode's memory, so I would rather exploit that here. By all means though, if you wish to parse MetaStore logs go ahead - but I would not like to have NNA be the driver for that. Kind of the same thing for the HBase side too.

Bit of a neat thing here - once the Hive stats are obtained we can compare against the rest of the cluster and say what % of data is (managed) Hive tables, HBase tables, etc.