magicDGS / ReadTools

A Universal Toolkit for Handling Sequence Data from Different Sequencing Platforms
https://magicdgs.github.io/ReadTools/
MIT License
6 stars 3 forks source link

4mc support for distmap upload #403

Closed robmaz closed 6 years ago

robmaz commented 6 years ago

What would it take to support the splittable 4mc compression format for up- and downloading? This would be much faster than bzip2, and although the compression ratio is also much worse, maybe it would be an acceptable compromise. (The basic implementation would be to pipe the uncompressed output through the command line utility, but with this being a java lib, there is probably a more Java-natural way to do this).

https://github.com/carlomedas/4mc

magicDGS commented 6 years ago

I guess that you refer to the distmap pipeline (upload). I changed the title to make it clear.

magicDGS commented 6 years ago

@robmaz - after investigating a bit about the 4mc compression, it looks like the library for java is not released in Maven Central or any repository to pull out with a build system.

Although I can use the https://jitpack.io/ automatic artifact build, are you sure that the library is in active development? Latest release (2.0.0) was in Sep 11, 2016, and although there are recent commits it does not look like a release will come in a regular basis.

The time to support automatic compression based using an already implemented compressor is minimal, just handling another extension in a simple function. If some configuration is required, it would take a bit longer, but I guess that it will be only necessary for distmap uploader, no?

Anyway, let me know if this is something just for testing the compressor or to implement for production. In the first case, I can do a PR and compile an unrelease copy of ReadTools to test stuff; if working, then I can implement test and do a point-release.

robmaz commented 6 years ago

Well, there is a high chance that this was someones PhD project or something and is left as is. But I think it is just a very thin layer over the actual lz4 frame format

https://github.com/lz4/lz4/blob/dev/doc/lz4_Frame_format.md

so maybe it can be kept alive in the future. In the present anyway, it seems to work as is both on the command line and for Hadoop 2.7.x, which we will probably be using for the foreseeable future.

I was thinking to make it a part of this transition to Hadoop 2.7.5 and the new distmap, that I am itching to do for some time now. I think replacing the bzip2 compression on upload will have a huge impact on upload times, which is one of the main complaints of people. The alternative to mc4 is basically turning compression off entirely for now.

It could also be used for download, i.e., the mappers would generate parts in .sam.mc4 format (or rather, Hadoop will compress the mapper output in this way) that you would have to handle.

magicDGS commented 6 years ago

And what's about using the Hadoop-BAM compressor for block-compressed files (bgzip)? This is a nice standard for compression in bioinformatics (tabix use it, and BAM is compressed in that way).

The BGZIP format is based on GZIP and is implemented in Hadoop-BAM as a SplittableCompressionCodec. I think that this alternative might be worth to try before going to the mc4 compressor.

Regarding the compression of the parts, that's a more difficult topic; that will require a major refactoring of the downloader and each part being in BAM format is already compressed; an extra compression will decrease the performance even more...

robmaz commented 6 years ago

For a 413MB fastq:

[vetlinux02@i122pas Testing]$ time bzip2 test1.fq

real 1m2.877s user 1m0.452s sys 0m0.812s

[vetlinux02@i122pas Testing]$ time bgzip test2.fq

real 0m52.598s user 0m48.787s sys 0m1.297s

[vetlinux02@i122pas Testing]$ time ./4mc test3.fq

real 0m3.307s user 0m2.438s sys 0m0.869s

[vetlinux02@i122pas Testing]$ ls -lh test?.fq*

-rw-r--r--. 1 vetlinux02 users 100M Feb 14 15:46 test1.fq.bz2 -rw-r--r--. 1 vetlinux02 users 139M Feb 14 15:48 test2.fq.gz -rw-r--r--. 1 vetlinux02 users 251M Feb 14 15:50 test3.fq.4mc

It's 2-3 times worse at compressing, but ~20 times faster.

Not sure I understand the second part of the last sentence. I'm pretty sure the current format of the parts is .sam.gz ?

robmaz commented 6 years ago

To put some interpretation on these numbers, 4mc compresses at about the same rate as hadoop fs can push data over the network, so it is basically free compression, even if it is not the best.

magicDGS commented 6 years ago

I am not sure if that will be the same in the distmap pipeline for several reasons:

On the other hand, I realized that the library cannot be used unless I add the jar file to our repository. I always try to avoid that kind of dependency management because it is error prone - nevertheless, I can do a PR to check if the performance improvement will be reflected also in the upload, without merging. In the meantime, I'll wait for the author response about the possibility of releasing to Maven Central (https://github.com/carlomedas/4mc/issues/33)

Another splittable options are (found in https://blogs.oracle.com/datawarehousing/hadoop-compression-choosing-compression-codec-part2 and https://www.cloudera.com/documentation/enterprise/5-3-x/topics/admin_data_compression_performance.html)

magicDGS commented 6 years ago

Ok, maybe this will be easier than I thought. The hadoop library provides a way to handle compression by using the extension of the file, and using installed codec providers (I think that they should be set on the configuration file).

If this work, then you can add this providers and test different compression algorithms, even if the provider is not included. But that means that you should have in the classpath the jar file from 4mc. I will do the PR because anyway is needed for supporting other compressors...we can try tomorrow evening if that works for you and like that HDFS compression is independent of ReadTools and depends only on the final user!

robmaz commented 6 years ago

That is also how I understood it. Hdfs should handle it on the fly if it finds the codec in the classpath and has it set in the hdfs configuration, and you tell it to.

4mc uses lz4, in the same way that bgzip uses gzip. I am a bit confused by this discussion you linked, because it claims that they had support for lz4 since 2011 or so, yet the 2.7.4 API mentions BZip2Codec as the only implementation of the SplittableCompressionCodec interface (http://hadoop.apache.org/docs/r2.7.4/api/ ) Which is why I was looking outside the Hadoop distro in the first place.

magicDGS commented 6 years ago

@robmaz - I think that the writing part of the issue will be solved by using the factory from Hadoop, but the reading part might be a real issue to support in ReadTools. If you are sure that the output of the new Distmap should be compressed in .4mc, can you open a different issue for reading HDFS files with that codec? We can discuss there other approaches, such as every part being a headerless-BAM, and on download it will be generated on demand. Maybe distmap can output a header in HDFS that will be added every time a single/multiple parts are download with ReadTools... Anyway, feel free to open a new issue for discussing that.

robmaz commented 6 years ago

Many thanks for looking into this. I will think some more about the download issue.

PS. The new upload compression approach would also work for bgzip2, no? Maybe it is worth timing this also against the current one?

2018-02-15 9:17 GMT+01:00 Daniel Gómez-Sánchez notifications@github.com:

@robmaz https://github.com/robmaz - I think that the writing part of the issue will be solved by using the factory from Hadoop, but the reading part might be a real issue to support in ReadTools. If you are sure that the output of the new Distmap should be compressed in .4mc, can you open a different issue for reading HDFS files with that codec? We can discuss there other approaches, such as every part being a headerless-BAM, and on download it will be generated on demand. Maybe distmap can output a header in HDFS that will be added every time a single/multiple parts are download with ReadTools... Anyway, feel free to open a new issue for discussing that.

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/magicDGS/ReadTools/issues/403#issuecomment-365854497, or mute the thread https://github.com/notifications/unsubscribe-auth/Ad_FfDh_KMKsQEADWGHh-gsLXcumIprJks5tU-g3gaJpZM4SFE22 .

magicDGS commented 6 years ago

It will work with any codec implemented for HDFS. The ones that are bundled in the ReadTools are:

In principle they should work out of the box; we should check if other compressors can be used by providing their distribution in the classpath when running ReadTools (that is blocked by #406, which is delegating the compression for HDFS files to the Hadoop library).

magicDGS commented 6 years ago

The latest master branch have the PR merged for the Hadoop compressors. Maybe you can try to run it with a custom classpath after installing with the HEAD option through our brew formula. @robmaz - can you tell me if that works?

magicDGS commented 6 years ago

For running with a custom classpath (assuming you only want the 4mc support):

java -cp ReadTools.jar:hadoop-4mc-2.0.0.jar org.magicdgs.readtools.Main

This is because the classpath is ignored with the -jar option. Probably I should add a documentation section for advance users: how to set a custom java.nio.Path provider and/or a Hadoop compressor, etc.

I think that if I would like to support this behavior, I will definitely need a wrapper script sooner or later to provide an easier way to run it. Thanks for pointing out ways to improve ReadTools.

magicDGS commented 6 years ago

Have you tested this, @robmaz? If so, and it works, please close the issue.

magicDGS commented 6 years ago

@robmaz - can you test that the upload with the custom classpath is working? I just release v1.3.0 with the changes included. Follow this instructions to run it: http://magicdgs.github.io/ReadTools/custom_java_classpath.html#example-usage-4mc-compression-for-distmap

magicDGS commented 6 years ago

Closing this issue - it should work although I haven't test it.