pysam-developers / pysam

Pysam is a Python package for reading, manipulating, and writing genomics data such as SAM/BAM/CRAM and VCF/BCF files. It's a lightweight wrapper of the HTSlib API, the same one that powers samtools, bcftools, and tabix.
https://pysam.readthedocs.io/en/latest/
MIT License
773 stars 274 forks source link

Pysam .csi index support & how to deal with large references #1278

Open RNieuwenhuis opened 4 months ago

RNieuwenhuis commented 4 months ago

Hi,

I use CrossMap as my favourite liftover tool because of its versatility and lately I've tried to use it to do liftover on some bam files with a large reference genome. Largest chromosome being 2446554542 in length.

I got this error:

Traceback (most recent call last):
  File "/home/WUR/nieuw133/miniconda3/envs/Crossmap/bin/CrossMap.py", line 281, in <module>
    crossmap_bam_file(mapping = mapTree, chainfile = chain_file, infile = in_file, outfile_prefix = out_file, chrom_size = targetChromSizes, IS_size = args.insert_size, IS_std = args.insert_size_stdev, fold = args.insert_size_fold, addtag = args.add_tags, cstyle = args.cstyle)
  File "/home/WUR/nieuw133/miniconda3/envs/Crossmap/lib/python3.10/site-packages/cmmodule/mapbam.py", line 406, in crossmap_bam_file
    new_alignment.next_reference_start =  read2_maps[1][1]
  File "pysam/libcalignedsegment.pyx", line 1346, in pysam.libcalignedsegment.AlignedSegment.next_reference_start.__set__
OverflowError: value too large to convert to int32_t

Now searching pysam history it seems reading of .csi indexed bam files was implemented at some point, but writing was not?

In general, I am confused with the different limits present. The default .bai index has a max reference SQ length of 2^29, which can be omitted by using a .csi index for which I could not find a specified max. However, the largest chromosome I have will also overflow int32_t max value of 2,147,483,647. This also seems to go over the sam/bam limit specified. Many tools will be stuck with this limit because it would break backwards compatibility it seems? Is there a specific reason why not unsigned int32 was used?

What is in general the way to deal with such large genomes? Is there a standardized way to deal with it? Are there ongoing discussions about how to deal with this? Any references are much appreciated.