hdfs: use 4mc instead of bzip2/gzip

robmaz commented 6 years ago

Compression is one of the bigger bottlenecks of the pipeline right now. 4mc is nearly 20x faster than bzip2 or bgzip and may offer a reasonable trade-off between compression delay and transfer speed.

magicDGS commented 6 years ago

If we are going to use ReadTools, there is maybe not such an improvement. As we discussed in the ReadTools repository, on-the-fly compression might not be a bottleneck and it should be profiled properly (there are java tools for that, such as https://www.ej-technologies.com/products/jprofiler/overview.html, that can help to check where the low speed hotspot).

Another option is to profile some upload/download using ReadTools with different compression (several times and taking average, maximum and minimum). I am still not sure that the compression is the major bottleneck: before the on-the-fly upload existed, the pipeline was taking even more due to compression locally (adding IO overhead on the local filesystem and disk usage), and uploading using hdfs (network bottleneck and IO in HDFS). The improvement was huge, but it might be that the limiting factor is compression now (there is going to be a limit of improvement at some point).

If people is complaining about speed, they should have been at the institute 3 years ago! That's one of the reasons that ReadTools have Distmap support! Hahaha

magicDGS commented 6 years ago

I added the ReadTools label, because it is kind of related (unless you remove the dependency of it on upload/download).

magicDGS commented 6 years ago

ReadTools already support hadoop-plugins for compression in the classpath, so this should be ready to test.

robmaz / distmap

hdfs: use 4mc instead of bzip2/gzip #52