prafulbhise / BigdataExpert

2 stars 0 forks source link

handling small files in hadoop #4

Closed prafulbhise closed 5 years ago

prafulbhise commented 5 years ago

small files are not good for hadoop, , this process is adding an overhead for cleansing data when we have tonnes of small files

prafulbhise commented 5 years ago

use hadoop streaming api to handle and achieve files compaction -

/usr/lib/jvm/java-8-openjdk-amd64/bin/java -Xmx1024m -Djava.net.preferIPv4Stack=true -Dhdp.version=2.6.5.3003-25 -Dhadoop.log.dir=/var/log/hadoop/sshuser -Dhadoop.log.file=hadoop.log -Dhadoop.home.dir=/usr/hdp/2.6.5.3003-25/hadoop -Dhadoop.id.str=sshuser -Dhadoop.root.logger=INFO,console -Djava.library.path=:/usr/hdp/2.6.5.3003-25/hadoop/lib/native/Linux-amd64-64:/usr/lib/hadoop/lib/native/Linux-amd64-64:/usr/hdp/2.6.5.3003-25/hadoop/lib/native -Dhadoop.policy.file=hadoop-policy.xml -Djava.net.preferIPv4Stack=true -Xmx1024m -Dhadoop.security.logger=INFO,NullAppender org.apache.hadoop.util.RunJar /usr/hdp/2.6.5.3003-25/hadoop-mapreduce/hadoop-streaming.jar -D mapred.reduce.tasks=10 -input adl://.azuredatalakestore.net/datalake-prod/tmp/gcss/data/table_name/date_part=2018-09-15 -output adl://.azuredatalakestore.net/datalake-prod/raw/gcss/data/table_name/date_part=2018-09-15 -mapper cat -reducer cat