Support for long reference sequences

jayoung commented 7 years ago

Hi there,

I'd like to add another voice for providing support for longer reference sequences: it already seems useful for some genomes, and as better assembled and larger genomes come out it seems like the need will get more frequent.

I've been working a little with the opossum monDom5 assembly where two chromosomes too long to be handled by IGV (chr1 is 748 Mb and chr2 is 541 Mb). I've been told it's an underlying limitation of htsjdk that creates this issue for IGV with longer reference sequences, so I'm posting the request here too.

Thanks for considering it,

Janet Young

Dr. Janet Young

Malik lab http://research.fhcrc.org/malik/en.html

Division of Basic Sciences Fred Hutchinson Cancer Research Center 1100 Fairview Avenue N., A2-025, P.O. Box 19024, Seattle, WA 98109-1024, USA.

tel: (206) 667 4512 email: jayoung ...at... fredhutch.org

lindenb commented 7 years ago

as far as I know the max length of a chromosome is max(int32)= 2,147,483,647 so there shouldn't have any problem on this side.

There may be a problem when you're using a binning index (like bam.bai or tabix.tbi) : where the max size is only 512Mb (http://genomewiki.ucsc.edu/index.php/Bin_indexing_system ). But you can always use a csi index: https://www.biostars.org/p/111984/

jrobinso commented 7 years ago

Yes, the problem is the index. Does htsjdk support csi indexes? My understanding is it does not. I'm looking at the source code but don't see anything.

lbergelson commented 7 years ago

It doesn't support CSI as far as I know.

yfarjoun commented 7 years ago

How much of a burden would it be to have support for csi indices?

On Fri, Jun 16, 2017 at 10:02 AM, Louis Bergelson notifications@github.com wrote:

It doesn't support CSI as far as I know.

— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub https://github.com/samtools/htsjdk/issues/900#issuecomment-309034123, or mute the thread https://github.com/notifications/unsubscribe-auth/ACnk0pfnDdfi3komGZOeL1zaw8ATPB2lks5sEor2gaJpZM4N7i9G .

lbergelson commented 7 years ago

I'm not really sure, but we haven't always done a great job of handling index related things in a clean way, and I suspect we have client code that relies on the details of the bai index since it's been the only one we support for bam. I'd guess that it would be a fair amount of work to implement.

yfarjoun commented 7 years ago

from what I saw online, bai is a specialization of csi....so it might not be so bad...it would be good for someone knowledgable about the format to step in though....

jayoung commented 7 years ago

Thanks all for looking into this: hoping someone capable (I am not capable) is interested in taking this on at some point.

Janet

nathanhaigh commented 6 years ago

There is another issue open regarding CSI index support in htsjdk: #447

FredericBGA commented 4 years ago

Hi,

Picard has a PR that allows to work with CSI index (from https://github.com/samtools/htsjdk/pull/1040) https://github.com/broadinstitute/picard/pull/998

But I need htsjdk to be able to write CSI as well (example of Wheat genome):

Exception in thread "main" htsjdk.samtools.SAMException: Exception when processing alignment for BAM index M01322:139:000000000-BVL2R:1:2108:5384:21249 2/2 301b aligned to chr1A:537177243-537177543. Caused by: htsjdk.samtools.SAMException: Exception creating BAM index for record M01322:139:000000000-BVL2R:1:2108:5384:21249 2/2 301b aligned to chr1A:537177243-537177543. Caused by: java.lang.IllegalStateException: Read position too high for BAI bin indexing.

How can I help?

lbergelson commented 4 years ago

I think GATK SHOULD work with bams with a CSI index already as well. There's no VCF support at the moment which will probably be problematic though.

I don't think we have any plan to implement it on our own in the near future. We just don't have the bandwidth to look into it, especially with the current pandemic craziness which is significantly cutting into people's coding time.

However, if you're interested in contributing and have a fairly significant chunk of time on your hands we'd be able to work with you on getting a CSI writer into htsjdk. In theory it shouldn't be too hard, it's basically just a tiny modification to the bai writer. I suspect there will be a lot of pain points though with how the code is structured and (poorly) tested.
The index code is unfortunately, really gross. It wasn't really designed to be extensible. There's also a weird divide between the bam index code and the vcf index code, the CSI could in theory be used for VCF to support long references, but currently we only support tabix. As far as I understand vcf CSI and bam CSI have to be slightly different due to how extra metadata is encoded in the bai/ bam csi.

@cmnbroad IS possibly going to be trying to modernize some of the index code it in the near future. He might be able to offer some guidance as to where to start if you're interested.

We'd love the help, but it will probably take a while.

jrobinso commented 4 years ago

Hey @lbergelson @cmnbroad hang in there, must be crazy where you are.

RE index modernization, ping me if you do that, I'm interested in modernizing the igv.js code as well, might be some synergy. Currently both bam and tabix indexes are handled by the same code, with a few switches, this in turn was adapted from even older code. At least its relatively small https://github.com/igvteam/igv.js/blob/master/js/bam/bamIndex.js

nathanhaigh commented 4 years ago

Perhaps the code developed for use by JBrowse could help?

BAM Index

Tabix Index

FredericBGA commented 4 years ago

Thank you for all these comments. I'm not sure that my skills (java and computing) will be enough, but I'm still ready to help. Of course I'm stuck at home right now like half of the people on Earth, so this is not the best timing but we could get back in touch when you managed to find time to make progress on this issue.

aschaetz commented 2 years ago

Hi Guys, First of all, thank you for the read support for .csi indices.

We start seeing more and more genomes where the contigs exceed the limits of the .bai index. While it's good that htsjdk can read these indices, we need write support, as well.

Not being able to write a .csi index is a significant limitation when working with large, well assembled, plant genomes. And we believe that this will become more important as well assembled genomes become more frequent, due to the rise of long read technologies.

Adding this functionality would be very much appreciated.

lindenb commented 2 years ago

there was a project samtools/htsjdk-next-beta which aim was to implement long REFs (eg.: https://github.com/samtools/htsjdk-next-beta/pull/6 ) . But the project looks inactive (dead ?) now.

aschaetz commented 2 years ago

Yes, sadly no activity there.

lynnjo commented 1 year ago

My group (Buckler Lab at Cornell University) also has a need for VCF CSI support for software we write/maintain. Our open source code (Buckler Lab PHG) uses gvcf files to store variant information for plant genomes and has run into problems working with the larger genomes .e.g wheat.

Has anything changed in terms of scheduling CSI support for VCF? We would be interested in working on this in the htsjdk code base if there were someone with whom we could consult. @lbergelson @cmnbroad (or anyone else) will you comment/ update on where this stands? Thanks

samtools / htsjdk

Support for long reference sequences #900