Support for loading CRAM + CRAI on large reference genomes. Instead of IndexOutOfBound

WimSpee commented 1 year ago

Dear IGV developers,

Thank you for the very nice IGV software.

We would like the visualize the content of CRAM files on large reference genomes in IGV. For example Wheat and Onion both have c.a. 16GB genomes with chromosomes much larger than 500Mb.

See below for the chromosome lengths of the public wheat genome assembly. Onion has only 8 chromosomes, but even larger, in the range of 1Gb to 2.5Gb.

Since it is now possible to create the reference genomes and re-sequencing data for relevant plant (and animal) large genomes species, it would be very nice if these could also be loaded in IGV. Using CRAM files and CRAI indexes.

For both Onion and Wheat CRAM files we get this exact error when trying to load the CRAM files into IGV.

Error loading large_genome.cram: An error occurred while accessing: large_genome.cram
Error loading BAM file: java.lang.ArrayIndexOutOfBoundsException: Index 32770 out of bounds for length 32770

This error also has been reported here: https://github.com/broadinstitute/gatk/issues/8192

The CRAM file is valid according to samtools quickcheck. The CRAI index was created via samtools index -c large_genome.cram The CRAI index can be used by samtools to read slices from the CRAM samtools view -T large_genome.fa large_genome.cram large_chr1B

Splitting the chromosomes to c.a. 500Mb is not really a workaround. For Onion some chromosomes would need to be split in 5 pieces.

And some other downstream analysis (on e.g. the VCF related to the CRAM) require the chromosomes to represent the real chromosome, e.g. to match with the genetic map.

Thank you for your thoughts and help on this.

Wheat IWGSC V1 reference genome https://urgi.versailles.inra.fr/download/iwgsc/IWGSC_RefSeq_Assemblies/v1.0/iwgsc_refseqv1.0_all_chromosomes.zip

chromosome	size
chr1A	594.102.056
chr1B	689.851.870
chr1D	495.453.186
chr2A	780.798.557
chr2B	801.256.715
chr2D	651.852.609
chr3A	750.843.639
chr3B	830.829.764
chr3D	615.552.423
chr4A	744.588.157
chr4B	673.617.499
chr4D	509.857.067
chr5A	709.773.743
chr5B	713.149.757
chr5D	566.080.677
chr6A	618.079.260
chr6B	720.988.478
chr6D	473.592.718
chr7A	736.706.236
chr7B	750.620.385
chr7D	638.686.055
chrUn	4.809.80714

jrobinso commented 1 year ago

IGV uses the same library as the GATK to read CRAM files, so its not going to be addressed until the underlying issue is addressed in the htsjdk. I see there is already an open ticket for that, which you referenced. I can't really think of a workaround, other than to try BAM + CSI indexs, but that won't help with VCFs.

As an experiment, have you tried using igv-web (https://igv.org/app)? It does not have any restrictions on chromosome length that I am aware of.

WimSpee commented 1 year ago

Thank you for the information. We will give IGV web app a try. That might be a good workaround for us. It will take a few days to get firewall access to it or get it installed on a computer close to the data. Will let you know if it works.

Does IGV web app use htslib instead of htsjdk to read in the CRAM and CRAI files?

jrobinso commented 1 year ago

IGV web app uses the jbrowse CRAM library. https://github.com/GMOD/cram-js

There is no need to install igv-webapp "close to the data", it is entirely a client program so everything runs in your web browser. There is no server component. https://igv.org just hosts the static html and javascript pages which are downloaded to your computer for running.

WimSpee commented 1 year ago

Local install is just a backup option if I can't get firewall access to https://igv.org/ on a machine that is close to the data. Cool that there is a pure JS CRAM reader.

jrobinso commented 1 year ago

We have zipped archives for local install, see the readme at https://github.com/igvteam/igv-webapp. Or you can just build it.

WimSpee commented 1 year ago

I managed to get firewall access to igv-webapp on https://igv.org/ on a Linux machine that is close to the data. I am using firefox 102.8.0esr .

However, I am unable to load any reference genome fasta in igv-webapp. Via Genome -> Local File.

Even the just 1Gb tomato genome fails to load. https://solgenomics.net/ftp//tomato_genome/assembly/build_4.00/S_lycopersicum_chromosomes.4.00.fa.gz

ERROR
Genome requires either a single JSON file or a FASTA file & index file
OK

fai and dict index files are present next to the fasta.

Java/HTSJDK IGV is able to open these fasta reference genomes on the same machine, using the same paths.

Does IGV web-app require anything special to open local reference genome files? Do you have advice on how to troubleshoot this error?

Thank you.

helgathorv commented 1 year ago

Due to security considerations for web browsers, IGV-Web is not able to automatically load the index file like the desktop application. You must load both the fasta file and the index file, at the same time. So if you are loading a local file, you need to select both in the file chooser.

WimSpee commented 1 year ago

Thank you for the information. I can confirm that by selecting both the FASTA+FAI and CRAM+CRAI I could now load and display the Onion genome and sequencing reads in IGV web-app.

Thanks again for the help and the nice and useful IGV software.

helgathorv commented 1 year ago

Glad it worked out. Thanks for letting us know.

jrobinso commented 1 year ago

That's good to know. Kudos to the JBrowse team, @rbuels and @cmdcolin especially for the JS Cram library. You are right, this is cool, it still amazes me they pulled this off.

igvteam / igv

Support for loading CRAM + CRAI on large reference genomes. Instead of IndexOutOfBound #1300