pndaproject / platform-data-mgmnt

PNDA components that manage aspects of datasets
Other
3 stars 20 forks source link

hdfs-cleaner for non-pnda #34

Open Raboo opened 6 years ago

Raboo commented 6 years ago

Hi,

Is it possible to get hdfs-cleaner working/packaged with non-pnda HDFS? We're running HDP and currently cleaning old files using a NFS mount (via HDFS NFSGateway). And we use find to delete old files. It's slow and buggy. I really haven't found any good solution to delete old files in HDFS. I would like to give your cleaner a try.

jeclarke commented 6 years ago

Hi @Raboo I believe this will work fine with generic HDFS.

Looking at the code, the part where it tries to clean up PNDA datasets should be skipped over if there is no dataset table available in HBase to read, and that's the only PNDA specific bit.

If you aren't using CDH or HDP as the hadoop distro you will have to do a bit of work as it wants to be able to connect to either Cloudera Manager (CDH) or Ambari (HDP) to discover the endpoints to use. If you you don't have either of those then you would have to fill out some other implementation of endpoint discovery in endpoint.py such as supplying the values directly in the config file.

There are few different categories of file in HDFS it cleans up: spark_streaming_dirs_to_clean - checks that the files do not correspond to currently running yarn jobs before deleting them general_dirs_to_clean - deletes all files from here old_dirs_to_clean - deletes file from here if the last modified time is are older than a certain age

Let me know how you get on, and do submit a patch if you manage to extend it in a useful way.

Thanks, James

Raboo commented 6 years ago

We are using HDP. spark_streaming_dirs_to_clean, general_dirs_to_clean, old_dirs_to_clean aren't that all just the same thing? folders where you look for older files that you can delete?

Or is this spark_streaming_dirs referring to spark history server?

Do you need to fill all fields in the properties file?