twitter / hadoop-lzo

Refactored version of code.google.com/hadoop-gpl-compression for hadoop 0.20
GNU General Public License v3.0
546 stars 329 forks source link

Compression Level is ignored. #142

Open wilcoln opened 4 years ago

wilcoln commented 4 years ago

I want to compress some file already inside hdfs using different compression levels. To do so, I write the following program:

Compress.java

import ...
import com.hadoop.compression.lzo.LzoCodec;

public class Compress {

 public static class VoidReducer extends Reducer<LongWritable, Text, Text, Text> {

   @Override
   public void reduce(LongWritable key, Iterable<Text> values, Context context) throws IOException, InterruptedException {
      for(Text value: values)
        context.write(value, new Text(""));
      }
   }

  public static void main(String[] args) throws Exception{

    Configuration conf = new Configuration();
    int level = Integer.parseInt(args[2]);
    conf.setInt("io.compression.codec.lzo.compression.level", level);

    Job job = Job.getInstance(conf);
    job.setJobName("Compresser Job");
    job.setJarByClass(Compress.class);
    job.setMapperClass(Mapper.class);
    job.setReducerClass(VoidReducer.class);
    job.setNumReduceTasks(1);

    TextInputFormat.addInputPath(job, new Path(args[0]));
    FileOutputFormat.setOutputPath(job,  new Path(args[1]));
    FileOutputFormat.setCompressOutput(job, true);

    FileOutputFormat.setOutputCompressorClass(job, LzoCodec.class);
    // submit and wait for completion
    job.waitForCompletion(true);

Then I execute run the following commands

$ javac -classpath $(hadoop classpath) *.java
$ jar -cvf Compress.jar Compress.class
$ hadoop jar Compress.jar Compress file.txt test1 1
$ hadoop jar Compress.jar Compress file.txt test7 7

The filefile.txt is of size 1Gb. When I then check the size of test1 and test2 with
hdfs dfs -du -s -h, I get 594.6 M for each. This proves that the compression level is ignored.

toddlipcon commented 4 years ago

Your code looks fine at first glance. I'm not actively maintaining this project anymore -- it's largely in maintenance mode as most people have moved on to using better file formats like Parquet along with LZ4 or Snappy. I'd suggest doing some debugging of your own -- rebuild hadoop-lzo with logging at the point where the compressor is created and see if it's getting passed through properly, and follow the breadcrumbs from there.

wilcoln commented 4 years ago

Ok thanks