powturbo / TurboBench

Compression Benchmark
326 stars 34 forks source link

Help artifacts compress test #18

Closed fabianomenegidio closed 4 years ago

fabianomenegidio commented 4 years ago

I would like to know if you could use your tool to evaluate genomic file compression metrics such as compression rate, speed, and especially data loss?

Could I adapt your script to parse specific genomics compression tools?

Tnx

powturbo commented 4 years ago

TurboBench is a benchmark for general purpose compressors. In general, it can also be used to benchmark other functions or other tools related or not related to compression. It also have an interface (not pushed to github) to call arbitrary turbobench functions as preprocessing before calling the main turbobench function. For example calling delta or transpose and then call a lz77 compressor. It is then called as "turboboench -Epre1,pre2 -ecompress"

You can look at the MY_CODEC example in plugins.cc and see how to implement your own turbobench functions. You must also include your functions in the makefile. It will be more clear if you can give an example of what you want to do.

fabianomenegidio commented 4 years ago

I work with different bioinformatics files in area-specific formats, such as FASTQ (https://en.wikipedia.org/wiki/FASTQ_format).

I would like to do an explanatory analysis of how different compressors handle these files, among other types, and what would be the best options for use in a big data world.

In addition, we have tools developed specifically for the area, such as MZPAQ (https://link.springer.com/article/10.1186/s13029-019-0073-5). I would like to compare a list of tools created for the industry with classic tools, as we continue to use * .gz in contrast to other tools.

Thanks for your attention right now.

powturbo commented 4 years ago

For your usage case, it is possible to extend the "multiblock mode" in turbobench. Current Format: Length1,block1,length2,block2,....lengthn,blockn. Actually for each block, turbobench is calling the same codec. When we save the codec id in the stream buffer we'll have: Length1,id1,block1,length2,id2,block2,....lengthn,idn,blockn. You can preprocess the input stream just before calling becomp" generating the new multiblock format. You must then modify becomp by extracting the block's codec id and then call "codcomp" with this id.

It is also possible to generate a multiblock file outside of turbobench. In this case you need only to modify becomp and bedecomp to compress/decompress each block with a different codec. Ex. First bock with zlib, second block with lzma,...

You must take into account, that turbobench is a actually a "in memory" benchmark with 1-2GB max. file length.

Is the MZPAQ source available?