pysam-developers / pysam

Pysam is a Python package for reading, manipulating, and writing genomics data such as SAM/BAM/CRAM and VCF/BCF files. It's a lightweight wrapper of the HTSlib API, the same one that powers samtools, bcftools, and tabix.
https://pysam.readthedocs.io/en/latest/
MIT License
773 stars 274 forks source link

How can I set the compression level of a BAM file when writing? #1259

Open clintval opened 8 months ago

clintval commented 8 months ago

Most samtools CLI tools let you set the compression level (e.g. 1 to 9) or even let the tool write uncompressed BAM.

Is this possible with pysam.AlignmentFile? If not, what do you think it would take to make this possible?

jmarshall commented 8 months ago

Underlying all those samtools commands' implementation is hts_open() and its mode parameter. Opening an AlignmentFile uses the same underlying routine, so in principle setting mode appropriately when opening an AlignmentFile would have the desired effect.

However for some reason AlignmentFile actually does enforce use of the values described as “valid modes” in its documentation. So at the moment the only real control you have over this is to use "wb0" to get uncompressed BAM.

In Python, we could add optional arguments like compression_level to set these options conveniently, allow specifying an optional htsFormat à la HTSlib's hts_open_format(), and/or relax the validation of the mode argument. (Probably taking advantage of Python's expressiveness is a better approach than making pysam users build C-style mode strings.)

clintval commented 6 months ago

Thank you @jmarshall! For now I write BAMs like:

with (
    AlignmentFile(
        f"{output}",
        mode="wbu" if output == Path("-") else "wb",
        template=reader,
        threads=compression_threads,
    )
) as writer:
    ...

Which should allow me to write uncompressed BAM when a part of a pipestream and compressed BAM when not. I'll leave the issue open unless you want to close it since it seems still not possible to set the actual compression level.

jmarshall commented 6 months ago

Let's leave it open as I'd like to implement something as per the comment above to make setting all this more flexible.