igvteam / igv.js

Embeddable genomic visualization component based on the Integrative Genomics Viewer
MIT License
637 stars 225 forks source link

support bgzip reference file #864

Open odoublewen opened 5 years ago

odoublewen commented 5 years ago

Feature request:
It would be nice if igv.js supported bgzipped fasta files for the "reference" object. (I assume it is not supported, as it is not mentioned here and also I've tried it without success.)

Background: samtools faidx will happily create an index on block gzipped fasta files. It actually creates two: the standard .fai one, and another .gzi, which marginally documented.

You have bgzip + tabix working, so seems like this might be feasible too.

jrobinso commented 5 years ago

Yes that's supported in IGV desktop. Its moderately complex, I don't know how many people use it, and you are the first to request it. I would consider a pull request for this, but it won't get prioritized anytime soon. I agree it would be nice.

cmdcolin commented 5 years ago

just for ref https://github.com/GMOD/indexedfasta-js supports bgzip fasta :)

jrobinso commented 5 years ago

Thanks @cmdcolin I might just use that. You've got pretty much everything there, I would contribute some modules if there's anything you are missing. VCF maybe?

odoublewen commented 5 years ago

@jrobinso Thanks - I agree it needn't be high priority, just a nice to have.

@cmdcolin Thanks for the reference. I'd used jbrowse many years ago, it's looking good these days.

cmdcolin commented 5 years ago

Grazi :) we started these separate npm modules for the data types as part of a big refactoring. We have a vcf module already but thanks. I liked the .hic adapter igv has though!

jrobinso commented 5 years ago

@cmdcolin You're just a little ahead of me, I have been planning to do the same thing with igv.js this spring. I'm on the fence on whether to use Node or ES6 type imports. I would like to use ES6 if Node had better support for it. RE hic adapter, are you referring to hic-straw or something else?

cmdcolin commented 5 years ago

Yep, the hic-straw module https://github.com/igvteam/hic-straw

cmdcolin commented 5 years ago

I think it's probably worth just use the ES6 type import/exports, but it's probably fine to do require too. If you have some level of babel/and/or/webpack it tends to just get smoothed over some way or the other.

Some people are opposed to the "export default" syntax in the ES6 modules for various reasons but you can use ES6 modules without export default too.

jrobinso commented 5 years ago

@cmdcolin We're way off topic now, but open issues in hic-straw for anything you need, I'm working on some optimizations now. I created that format (.hic) back in 2010, its due a revision, not optimized for the web.

SchwarzMarek commented 5 years ago

Hi everybody, I also could use the zipped fasta for input :). Best regards

jrobinso commented 5 years ago

@SchwarzMarek it's on my short term list.

PSB-mibel commented 2 years ago

As this feature request is still open, I assume it's still not working. I tried myself, and either invalid range requests were produced, or random characters were displayed (indicating wrong offset when reading/unzipping data). Is there still a plan to implement bgzip support?

The main thing would be that it would save a massive amount of disk-space (having to host multiple unzipped 20GB genomes seems a bit wasteful)

jrobinso commented 2 years ago

@PSB-mibel no doubt it would be a good thing, I would welcome a pull request here.

graymauser commented 2 years ago

@PSB-mibel Hi, I've made a server-side workaround to be able to use the bgzipped fasta files. The basic idea is to use the bgzip to read the required portion of the file and then serve it to the igv.js.

I have a python code for this if you are interested. Best regards

PSB-mibel commented 2 years ago

@graymauser Thanks for the offer, but I'm currently working on a fix in IGV.js itself (seems a more long-term solution to the issue).

@jrobinso How do you see the configuration options for this extension? I've currently chosen 'compressedIndexURL', as in order to make the BGZIP reference genome work with the GMOD/indexedfasta I'd need 2 'indexURL' options: 1) indexURL: fasta.gz.fai 2) compressedIndexURL: fasta.gz.gzi I could also just append the '.gzi' suffix extension to the fastaURL, but this goes a bit against the philosophy of the existing configuration option for the indexURL (where you could just have appended the .fai extension).

In addition, do you think it would be necessary to add a 'compression' option to the JSON configuration options for tracks/reference? While we can deduce that files should be compressed based on the compressedIndexURL option, the absence of this option does not mean that the reference isn't compressed (but it's also not indexed). Again, relying on the file extensions here might also be dangerous, and making the code future proof by indicating the type of compression used might be a good idea as well. So the idea would be to add a 'compression: FALSE|"bgzip" ' option to the configuration, but I'm not sure what your take is on this.

jrobinso commented 2 years ago

Hi @PSB-mibel that's great. Yes add a new property "compressedIndexURL", so a bgzip fasta will need both indexes. As you know, the bgzip index (gzi) is needed to find the blocks, the indexURL gives the virtual file offsets.

You might find this implementation helpful, its "Java" for the IGV desktop application: https://github.com/igvteam/igv/blob/master/src/main/java/org/broad/igv/feature/genome/fasta/FastaBlockCompressedSequence.java

I don't understand the question wrt tracks. BGZIP is already supported for tracks ("tabix"). Tracks and reference sequence are independent. Either or both can be compressed or not, or even indexed or not. If you are asking about a gzipped reference that is not block compressed (and not indexed) we can allow that, I think it should work automatically but if it doesn't that's something I can fix. The only combination not allowed would be a bgzip compressed fast with an indexURL but with no compressedIndexURL, either both are present or both are missing (which would be a rare edge case).

jrobinso commented 2 years ago

@PSB-mibel some more tips you'll probably need. The following code snippet from bamReader.js queries for compressed blocks from a bam index, then decompresses them. You will need to do something similar, if not identical. The variable "ba" at the end contains the bytes as an UInt8Array.

Its important to use the igv code igvxhr and buildOptions in order to maintain support for oAuth.


import {BGZip, igvxhr} from "../../node_modules/igv-utils/src/index.js"
import {buildOptions} from "../util/igvUtils.js"

...

            const bamIndex = await this.getIndex()
            const chunks = bamIndex.blocksForRange(chrId, bpStart, bpEnd)

            if (!chunks || chunks.length === 0) {
                return alignmentContainer
            }

            let counter = 1
            for (let c of chunks) {

                let lastBlockSize
                if (c.maxv.offset === 0) {
                    lastBlockSize = 0    // Don't need to read the last block.
                } else {
                    const bsizeOptions = buildOptions(this.config, {range: {start: c.maxv.block, size: 26}})
                    const abuffer = await igvxhr.loadArrayBuffer(this.bamPath, bsizeOptions)
                    lastBlockSize = BGZip.bgzBlockSize(abuffer)
                }
                const fetchMin = c.minv.block
                const fetchMax = c.maxv.block + lastBlockSize
                const range = {start: fetchMin, size: fetchMax - fetchMin + 1}

                const compressed = await igvxhr.loadArrayBuffer(this.bamPath, buildOptions(this.config, {range: range}))

                var ba = BGZip.unbgzf(compressed) //new Uint8Array(BGZip.unbgzf(compressed)); //, c.maxv.block - c.minv.block + 1));

``
PSB-mibel commented 2 years ago

Forgot to mention this in the pull-request: I've done quite some testing using a variety of reference genomes (Arabidopsis, human, maize) and it all seems to work fine. The retrieved sequences also seemed to be correct, as the MD5 checksum for CRAM files didn't report any problems (which was the case during development due to some offset issues). As such, I think the results can be trusted.

odoublewen commented 4 months ago

@jrobinso Looks like this issue could be closed, thank you (and thanks to @PSB-mibel too!).

You may want to add a note to https://github.com/igvteam/igv.js/wiki/File-formats to indicate that bgzipped fasta files are now accepted.

Thanks again!