igvteam / igv-notebook

Module for embedding igv.js in an IPython notebook
MIT License
59 stars 13 forks source link

Hosted file moved? #19

Closed yonghaoy closed 1 year ago

yonghaoy commented 1 year ago

Hello, IGV.js is broken recently because https://igv.genepattern.org/genomes/seq/hg38/hg38.fa.fai is blocked by our Content Security Policy. It is expected because we have Content Security Policy allowlist and that allowlist does not contain that url. IGV urls in our current allowlist are

    "https://s3.amazonaws.com/igv.broadinstitute.org/",
    "https://s3.amazonaws.com/igv.org.genomes/",
    "https://portals.broadinstitute.org/webservices/igv/",
    "https://igv.org/genomes/",

I am wondering if igv.js recently move their hosted files to igv.genepattern.org? I saw @jrobinso mentioned the moving here: https://github.com/igvteam/igv.js/issues/1570#issuecomment-1357951675 but I am not sure if that is the same thing.

Thanks!

jrobinso commented 1 year ago

Yes, we will be slowing moving all our data to that host. If you want to absolutely protect yourself against data moves you could host genomes on your own server, but we don't move data often.

yonghaoy commented 1 year ago

Thanks @jrobinso To confirm, is hg38.fa.fai moved to igv.genepattern.org already? I am investigating why igv is broken recently, and want to confirm adding https://igv.genepattern.org to our CSP allowlist can fix this issue. Thanks Yonghao

jrobinso commented 1 year ago

Yes. But again, if this is an issue for your organization you might consider hosting the files you need within your organization locally, we host data for IGV as a convenience but it is not required you use our hosted data. Costs are becoming an issue and files could move again in the future. Some instructions are here: https://github.com/igvteam/igv/wiki/Hosting-Genomes

Another important host, and I'm surprised IGV works at all without this one whitelisted is

https://data.broadinstitute.org/igvdata

Also, the following might well be used

https://igv-genepattern-org.s3.amazonaws.com

yonghaoy commented 1 year ago

Thanks for getting this back. I am not familiar with IGV and how to use it. We are developing tools for scientists who are heavily using IGV. The python code snippet that stop working(and I don't know how/where to set host data) is

import igv

b = igv.Browser({"genome": "hg38"})
b.load_track(
    {
        "name": "wgs_1000004",
        "url": "wgs_1000004.cram",
        "format": "cram",
        "type": "alignment",
        "indexURL": "wgs_1000004.cram.crai",
        "indexed": True
    })

b.show()

RE: https://data.broadinstitute.org/igvdata

The server(https://app.terra.bio/) we are developing is hosted by Broad. And I think all Broad hostnames are allowed.

RE: https://igv-genepattern-org.s3.amazonaws.com/

Thanks! I will also add that into our allowlist...

jrobinso commented 1 year ago

Ahh, you are actually using https://github.com/igvteam/igv-notebook then. OK, sorry, my suggestion to consider hosting data on your own servers still apply, however the instructions I pointed to was for IGV desktop.

What version of igv-notebook are you using? The most recent is 0.4.4, although for all practical purposes this is ready to release as 1.0.0. https://pypi.org/project/igv-notebook/

The configuration looks suspect, in particular the url and indexURL are not qualifed. I'm not sure how this is working, but if it was working before then the host name change might be the problem.

I'm going to transfer this issue to igv-notebook.

yonghaoy commented 1 year ago

Actually we are using igv-jupyter https://github.com/g2nb/igv-jupyter which wraps igv.js by way of igv-notebook.

jrobinso commented 1 year ago

OK. igv-jupyter is focused on the needs of the g2nb project, its not something I personally have any involvement in. I don't understand how those url property values are working but maybe it uses some magic of some kind. Anyway igv-jupyter project would be the place to discuss that.

yonghaoy commented 1 year ago

Hey @jrobinso need to reopen this issue again as we are broken by host move(and I suspect that is caused by this genomes.json updates). Now our content security policy complains about 1: https://igv-genepattern-org.s3.amazonaws.com/ URL not in our CSP allowlist 2: https://hgdownload.soe.ucsc.edu/goldenPath/hg38/database/ncbiRefSeq.txt.gz is not in our CSP allowlist.

I checked this reference https://s3.amazonaws.com/igv.org.genomes/genomes.json and did see those two urls are now used by hg38. And you also mentioned in this issue: that you changed to refer to UCSC url directly. I want to confirm the 1: breakage is expected, 2 we just need to add the two url in our CSP allowlist? 3: How can we check the changelog for genome references?

Thanks

jrobinso commented 1 year ago

I don't understand question (1), but urls to data will change from time to time, these are not noted in change logs as they are not part of the application. As I suggested above you should consider hosting these files yourselves, if you want absolute control, we provide them as a service but it is not mandatory to use our (or UCSC's) hosted files. That said these do not change often.

In the future we will be doing more direct references to UCSC hosted files, which will include the host you reference and possibly others in the UCSC domains. I had already list https://igv-genepattern-org.s3.amazonaws.com/ earlier.

yonghaoy commented 1 year ago

Thanks @jrobinso . For 1, we were just trying to understand why IGV breaks this time, and i want to confirm the breakage is caused by "IGV moves its hosted files (specified in that genomes.json)".

For hosting our reference, we were thinking about that. For now, I got it working by using old reference url(and seems the next step would be hosting those references, then replace those URLs):

import igv_notebook

igv_notebook.init()
igv_browser = igv_notebook.Browser(
    {
        "reference": {
            "id": "custom_hg38",
            "name": "Custom HG38 reference that works in Terra",
            "fastaURL": "https://s3.amazonaws.com/igv.broadinstitute.org/genomes/seq/hg38/hg38.fa",
            "indexURL": "https://s3.amazonaws.com/igv.broadinstitute.org/genomes/seq/hg38/hg38.fa.fai",
            "aliasURL": "https://s3.amazonaws.com/igv.org.genomes/hg38/hg38_alias.tab",
            "tracks": [
                {
                    "name": "Refseq Genes",
                    "format": "refgene",
                    "url": "https://s3.amazonaws.com/igv.org.genomes/hg38/ncbiRefSeq.sorted.txt.gz",
                    "indexed": "false",
                    "removable": "false",
                    "order": 1000000,
                    "infoURL": "https://www.ncbi.nlm.nih.gov/gene/?term=$$"
      }
            ]
        },
        "locus": "chr22:24,376,277-24,376,350"
    })

igv_browser.load_track(
    {
        "name": "1000004 CRAM",
        "url": "wgs_1000004.cram",
        "format": "cram",
        "type": "alignment",
        "indexURL": "wgs_1000004.cram.crai",
        "indexed": True
    })
igv_browser.show()
jrobinso commented 1 year ago

Yes that is what I recommend if you want full control over the URLs. Specifically do not use the "genome: id" shortcut, but fully specify everything.