opencb / cellbase

High-Performance NoSQL database and RESTful web services to access to most relevant biological data
Apache License 2.0
89 stars 53 forks source link

Add ExAC pLI and pLoF (oe) scores from gnomAD #386

Open javild opened 5 years ago

javild commented 5 years ago

ExAC pLI scores have been replaced by pLoF (oe) scores in gnomAD, you can find more info at https://macarthurlab.org/2018/10/17/gnomad-v2-1/

With gnomAD, we have shifted from using the probability of being loss-of-function intolerant (pLI) score developed with ExAC and now recommend using the observed / expected (oe) score. ... The change from pLI to oe was motivated mainly by its easier interpretation and its continuity across the spectrum of selection.

Tasks:

julie-sullivan commented 4 years ago

https://macarthurlab.org/2018/10/17/gnomad-v2-1/

julie-sullivan commented 4 years ago

Download looks okay.

2019-10-08 13:01:21 [main] INFO  DownloadCommandExecutor:542 - Downloading gnomAD data...
2019-10-08 13:01:21 [main] DEBUG EtlCommons:98 - Executing command: wget --tries=10 https://storage.googleapis.com/gnomad-public/release/2.1.1/constraint/gnomad.v2.1.1.lof_metrics.by_transcript.txt.bgz -O /tmp/homo_sapiens_grch37/gnomad.v2.1.1.lof_metrics.by_transcript.txt.bgz -o /tmp/homo_sapiens_grch37/gnomad.v2.1.1.lof_metrics.by_transcript.txt.bgz.log
2019-10-08 13:01:23 [main] INFO  DownloadCommandExecutor:1192 - /tmp/homo_sapiens_grch37/gnomad.v2.1.1.lof_metrics.by_transcript.txt.bgz created OK

Will the build be able to handle the weirdo bgz file extension?

imedina commented 4 years ago

In Java you typically you read BGZ using HTSJDK library:

https://static.javadoc.io/com.github.samtools/htsjdk/2.20.0/htsjdk/samtools/util/BlockCompressedInputStream.html

julie-sullivan commented 4 years ago

https://gnomad.broadinstitute.org/downloads

pLoF Metrics by Transcript TSV
pLoF Metrics by Gene TSV
julie-sullivan commented 4 years ago

data looks to be present in gene.json.gz:

**"pvalue":2.4564667E-7},{"geneName":"ENSG00000168032","experimentalFactor":"organism_part","factorValue":"trachea","experimentId":"E-MTAB-25","technologyPlatform":"A-AFFY-
33","expression":"UP","pvalue":0.036223516}],"constraints":
[{"source":"gnomAD","method":"pLoF","name":"oe_mis","value":0.98866},
{"source":"gnomAD","method":"pLoF","name":"oe_syn","value":0.88526},
{"source":"gnomAD","method":"pLoF","name":"oe_lof","value":1.1728}]}}**
julie-sullivan commented 4 years ago

from mongo:

for the non-canonical transcript

            "annotation" : {
                "constraints" : [
                    {
                        "source" : "gnomAD",
                        "method" : "pLoF",
                        "name" : "oe_mis",
                        "value" : 1.0187
                    },
                    {
                        "source" : "gnomAD",
                        "method" : "pLoF",
                        "name" : "oe_syn",
                        "value" : 1.0252
                    },
                    {
                        "source" : "gnomAD",
                        "method" : "pLoF",
                        "name" : "oe_lof",
                        "value" : 0.73739
                    }
                ]

for the gene (and canonical transcript):

        "constraints" : [
            {
                "source" : "gnomAD",
                "method" : "pLoF",
                "name" : "oe_mis",
                "value" : 1.0141
            },
            {
                "source" : "gnomAD",
                "method" : "pLoF",
                "name" : "oe_syn",
                "value" : 1.0299
            },
            {
                "source" : "gnomAD",
                "method" : "pLoF",
                "name" : "oe_lof",
                "value" : 0.78457
            }

Which match the numbers in the text file.

julie-sullivan commented 4 years ago

@imedina for the ExAC scores this is what's available:


exac_pLI | exac_obs_lof | exac_exp_lof | exac_oe_lof

I will load exac_oe_lof, do you want exac_pLI too?

julie-sullivan commented 4 years ago

And do we want them in the same Constraints array?

julie-sullivan commented 4 years ago

I added a unit test that compares the list of Constraints created by gnomAD using JUnit's assertEquals. This works fine if I add Constraint.equals() to biodata. I noticed that we overrode toString but not equals.

Would it be okay to add equals?

Otherwise, I can add a comparator in the unit test that iterates through the lists and compares the fields in each object.

See: https://github.com/opencb/biodata/pull/173

julie-sullivan commented 4 years ago

3.0 is released

https://macarthurlab.org/2019/10/16/gnomad-v3-0/

(don't have the new scores yet)

julie-sullivan commented 4 years ago

Download no longer works:

2019-10-28 14:32:03 [main] WARN  DownloadCommandExecutor:1013 - 
https://storage.googleapis.com/gnomad-public/release/2.1.1/constraint//tmp/homo_sapiens_grch37
/gene/gnomad.v2.1.1.lof_metrics.by_transcript.txt.bgz cannot be downloaded