pycompression / xopen

Open compressed files in Python
MIT License
68 stars 15 forks source link

Feature request - Is there potential to support bgzip? #92

Open pettyalex opened 2 years ago

pettyalex commented 2 years ago

This is a clever solution to get the decompression / compression into a separate process, very helpful little tool. This pattern would work well for a couple additional compression tools that my group uses. I'm going to take a look at adding support for these formats, which should be pretty straightforward.

BGZF, or "blocked gzip" is a format that's used pretty widely in bioinformatics, it's basically a lot of gzipped files concatenated together, with some extra info in the headers and an index in a separate file saying where to seek. It's decompressible by normal gzip, so we actually see bgzf files as .gz more often than .bgz. It'd be really great to be able to compress bgz files with xopen as well. The blocked gzip reference implementation is distributed with htslib as a binary called bgzip, and is available both from conda and most linux distros native packages (tabix on Ubuntu, for example).

Also, it'd be great to see this support zstd as well, which is just an excellent general purpose compression tool that I expect to rapidly grow in usage in the next few years.

Edit: To be clear, both of these tools are already usable from Python, there's a bgzip implementation here, and zstd has excellent Python bindings available, but getting the compression into another process like xopen does makes for much better performance.

marcelm commented 2 years ago

There’s a draft pull request for zstd support in #91, please also see the discussion over there.

Regarding bgzip support, I think compression support is fine.

Regarding decompression (which I realize you didn’t ask for), I think the support we already have for gzip is sufficient (we have tests that ensure we can decompress concatenated gzip files). I would argue that exposing the seeking capabilities of bgzip wouldn’t be in scope for xopen because the idea is to provide an interface that lets one abstract away the used compression algorithm.

rhpvorderman commented 2 years ago

I would like to add to that, that BGZF compression is only advantageous when the input data is aligned properly with the BGZF blocks. In the case that you not care for that alignment, it is faster to compress with normal gzip compression. It is impossible for xopen to know how the input data should be aligned. (The interface handles pure bytes objects, not HTS records). EDIT: 2024-02-16, I have since found out that this statement is false. The data does not need to be properly aligned and can simply be streamed. It is still advantegeous as it can still be indexed.

If you know beforehand that you do want BGZIP, there is also no need for xopen's dynamic choosing of compression algorithm. So you might be better of using subprocess and bgzip that way.

As regards to:

Python bindings available, but getting the compression into another process like xopen does makes for much better performance.

Pysam has a threads option when opening its files. That will already probably give more performance, since it does not have to manage open processes like xopen does. It also has the advantage of knowing exactly how the output blocks should be aligned, so the resulting compressed files can be indexed.

pettyalex commented 2 years ago

For our specific usage case that I want this right now, it doesn't seem like Pysam has the interface for what we're wanting to do: bgzip output to create a .vcf.bgz file and associated tabix index, because downstream tools are expecting bgzipped vcfs with indexes.

Pysam has a threads option when opening its files. That will already probably give more performance, since it does not have to manage open processes like xopen does. It also has the advantage of knowing exactly how the output blocks should be aligned, so the resulting compressed files can be indexed.

I'm going to do some quick performance testing to see how using threads inside of the main Python process does vs opening another process and piping uncompressed data to it like xopen does. My gut would actually say that xopen's approach is faster, just because in the 1st approach, the process will either be copying data in from Python or compressing it in the pysam threadpool, but not both at the same time because of the GIL, while xopen allows Python to send data via pipe without blocking unless the pipe fills up.

I'm taking zstd support discussion to that open PR.

rhpvorderman commented 2 years ago

Have you tried using PipedCompressionWriter with program_args = ["bgzip", "-i"]. Wouldn't that already solve your problem?

pettyalex commented 2 years ago

That looks like it would, I'll validate this.

Would the xopen project be willing to make that support native, so that if you open a .bgz for writing, or a .gz and specify a bgzf or bgz format it will call bgzip?

rhpvorderman commented 2 years ago

For xz, bz2 and `gz`` xopen can fall back on modules from Python's standard library if command line applications are not present. There is no such thing for bgzip. Which means xopen can't guarantee that it can open bgzip files. Unless we would add a dependency that can read bgzip from python. At that point I think it is already quite a lot of work to get some support going and we can only support the features that bgzip has in common with the other formats. (so indexing can not be supported). The reason we can not do this, is that it would make the API more complicated with some features that only apply to bgzip. That would be a loss for the project, in my opinion.

Also the whole problem xopen tries to solve is that you don't know what output the user expects: stdout? gzip? uncompressed? Xopen infers that for you. If you already know what format you want (indexed bgzip in your case), then it is much less overhead to simply let the program output that format, rather than rely on xopen.

@marcelm What do you think?

marcelm commented 2 years ago

@pettyalex Is it correct that you want an indexed bgz file, that is, that you would want to run bgzip with -i?

For me, the question is whether we can add support for bgzip without adding another parameter to the xopen() function that only applies to bgzip. As Ruben said, we shouldn’t make the API more complicated.

The easiest case is that we just never create an index (run bgzip without -i).

Always generating an index is the other option, but I wonder whether we wouldn’t get requests to enable setting the index name (bgzip option -I).

I’m not too worried about bgzip not being available in the standard library. If bgzip isn’t installed, we just print an error and it’s the responsibility of whoever uses xopen to ensure that bgzip is in their list of dependencies.

Whatever we end up doing, wo should at least document the PipedCompressionWriter and PipedCompressionReader classes better.

pettyalex commented 2 years ago

Is it correct that you want an indexed bgz file, that is, that you would want to run bgzip with -i?

Yes, absolutely. For our workflows, we need indexes everywhere that we'd be using bgzip, as bgzip itself performs worse both speed and compression-wise vs other options. The index is its only advantage. The default index name would also be appropriate in every situation we'd want to use this with.

ghuls commented 1 year ago

Yes, absolutely. For our workflows, we need indexes everywhere that we'd be using bgzip, as bgzip itself performs worse both speed and compression-wise vs other options. The index is its only advantage. The default index name would also be appropriate in every situation we'd want to use this with.

Do you use with the threading option? bgzip -@ 4 .... Also if you compile bgzip, you need to make sure you have libdeflate installed as it is much faster than standard zlib. Some old zlib benchmarks by HTSlib (bgzip people): http://www.htslib.org/benchmarks/zlib.html

BGZF can be used without index. BCF2FASTQ uses it to compress FASTQ files, where multiple chunks can be compressed at the same time in different threads.