Closed DSuveges closed 2 weeks ago
The redundancy in Finngen studies probably can be explained by the redundant processing of input data. It has been fixed in orchestration. see PR#44
Redundancy in the EQTL Catalogue dataset is also quite prevalent. For these credible sets the explanation cannot be the same as for finngen. Let's see at the discrepancies:
+--------------------------------+---------+----------+------------------+
|studyLocusId |locusSize|uniqueSize|diff |
+--------------------------------+---------+----------+------------------+
|a1a4f2ad30bd99b8d99b407c44282546|11 |2 |5.5 |
|ed46940a38948e1eec9f12147c4e7cb7|9 |2 |4.5 |
|6d86b145f3d5ba93f24af97454a107b0|9 |2 |4.5 |
|64523fa64e5b58742cc62782da91bb27|8 |2 |4.0 |
|473f790ab5bdc6ce8d1913cee373d103|8 |2 |4.0 |
|f357d44cc77ffa813eb9899f4be19630|7 |2 |3.5 |
|181e3f70e717711d86285bd65f74115f|7 |2 |3.5 |
|e7ce0b657a05bf138c7551bbf5ea5a5a|7 |2 |3.5 |
|642fe8a8c00a1120ec284b05517f24ee|7 |2 |3.5 |
|9d5afce72c791362a86cfd449346cc90|7 |2 |3.5 |
|98e4c31d803b389167f0859f6cc3c69d|17 |5 |3.4 |
|145e9749836a82e79ddfa6b29289c8ee|10 |3 |3.3333333333333335|
|73a32407a6460322019e5eda121615cc|19 |6 |3.1666666666666665|
|a5beb5e085c5911b9002994cc8be5bdc|6 |2 |3.0 |
|3565367ff63d7b5b3f43bc7b45ee3c67|6 |2 |3.0 |
|9552de8cbd44edec4f9570038a4f9d00|6 |2 |3.0 |
|07dabd281b999684a7ce70152ffb141f|6 |2 |3.0 |
|c9dedea9f3c93abab00798ab844b24d1|6 |2 |3.0 |
|9797c42224764bcb58fbdb4c52965cb8|6 |2 |3.0 |
|37b6db3dc373ef3aba7b51530db7158c|6 |2 |3.0 |
+--------------------------------+---------+----------+------------------+
For testing purposed I used credible set id: a1a4f2ad30bd99b8d99b407c44282546
, where the locus contains 11 variants, however only 2 unique. Tracking down the credible set to this file from source:
gs://eqtl_catalogue_data/ebi_ftp/susie/QTS000034/QTD000580/QTD000580.lbf_variable.txt.gz
gs://eqtl_catalogue_data/ebi_ftp/susie/QTS000034/QTD000580/QTD000580.credible_sets.tsv.gz
When manually pushing through the data, the resulting credible set is already faulty:
{
"studyType": "eqtl",
"variantId": "17_47829273_A_AGAAG",
"chromosome": "17",
"position": 47829273,
"region": "chr17:46827139-48827139",
"studyId": "Walker_2019_exon_Neocortex_ENSG00000159111.13_17_47827074_47827204",
"beta": -0.38678,
"pValueMantissa": 2.5420000553131104,
"pValueExponent": -11,
"standardError": 0.0546454,
"finemappingMethod": "SuSie",
"credibleSetIndex": 1,
"locus": [
{
"variantId": "17_47829273_A_AGAAG",
"posteriorProbability": 0.998347615745736,
"pValueMantissa": 2.5420000553131104,
"pValueExponent": -11,
"logBF": 24.224072096733,
"beta": -0.38678,
"standardError": 0.0546454,
"is95CredibleSet": true,
"is99CredibleSet": true
},
{
"variantId": "17_47829273_A_AGAAG",
"posteriorProbability": 0.998347615745736,
"pValueMantissa": 2.5420000553131104,
"pValueExponent": -11,
"logBF": 24.224072096733,
"beta": -0.38678,
"standardError": 0.0546454,
"is95CredibleSet": false,
"is99CredibleSet": false
},
{
"variantId": "17_47829273_A_AGAAG",
"posteriorProbability": 0.998347615745736,
"pValueMantissa": 2.5420000553131104,
"pValueExponent": -11,
"logBF": 24.224072096733,
"beta": -0.38678,
"standardError": 0.0546454,
"is95CredibleSet": false,
"is99CredibleSet": false
},
{
"variantId": "17_47829273_A_AGAAG",
"posteriorProbability": 0.998347615745736,
"pValueMantissa": 2.5420000553131104,
"pValueExponent": -11,
"logBF": 24.224072096733,
"beta": -0.38678,
"standardError": 0.0546454,
"is95CredibleSet": false,
"is99CredibleSet": false
},
{
"variantId": "17_47829273_A_AGAAG",
"posteriorProbability": 0.998347615745736,
"pValueMantissa": 2.5420000553131104,
"pValueExponent": -11,
"logBF": 24.224072096733,
"beta": -0.38678,
"standardError": 0.0546454,
"is95CredibleSet": false,
"is99CredibleSet": false
},
{
"variantId": "17_47829273_A_AGAAG",
"posteriorProbability": 0.998347615745736,
"pValueMantissa": 2.5420000553131104,
"pValueExponent": -11,
"logBF": 24.224072096733,
"beta": -0.38678,
"standardError": 0.0546454,
"is95CredibleSet": false,
"is99CredibleSet": false
},
{
"variantId": "17_47829273_A_AGAAG",
"posteriorProbability": 0.998347615745736,
"pValueMantissa": 2.5420000553131104,
"pValueExponent": -11,
"logBF": 24.224072096733,
"beta": -0.38678,
"standardError": 0.0546454,
"is95CredibleSet": false,
"is99CredibleSet": false
},
{
"variantId": "17_47829273_A_AGAAG",
"posteriorProbability": 0.998347615745736,
"pValueMantissa": 2.5420000553131104,
"pValueExponent": -11,
"logBF": 24.224072096733,
"beta": -0.38678,
"standardError": 0.0546454,
"is95CredibleSet": false,
"is99CredibleSet": false
},
{
"variantId": "17_47829273_A_AGAAG",
"posteriorProbability": 0.998347615745736,
"pValueMantissa": 2.5420000553131104,
"pValueExponent": -11,
"logBF": 24.224072096733,
"beta": -0.38678,
"standardError": 0.0546454,
"is95CredibleSet": false,
"is99CredibleSet": false
},
{
"variantId": "17_47829273_A_AGAAG",
"posteriorProbability": 0.998347615745736,
"pValueMantissa": 2.5420000553131104,
"pValueExponent": -11,
"logBF": 24.224072096733,
"beta": -0.38678,
"standardError": 0.0546454,
"is95CredibleSet": false,
"is99CredibleSet": false
},
{
"variantId": "17_47829273_A_AGAAG",
"posteriorProbability": 0.998347615745736,
"pValueMantissa": 2.5420000553131104,
"pValueExponent": -11,
"logBF": 24.224072096733,
"beta": -0.38678,
"standardError": 0.0546454,
"is95CredibleSet": false,
"is99CredibleSet": false
}
],
"studyLocusId": "a1a4f2ad30bd99b8d99b407c44282546",
"credibleSetlog10BF": 26.621967315673828
}
Essentially this is the same variant repeated 11 times. When looking at the data, this discrepancy coming from the credible set file, where the individual rows are differentiated by the rsId of the variant:
+--------------------+---------------+--------------------+--------------------+-----------+-------+-----------------+-----------+--------+---------+-----------------+---------+--------------------+
| molecular_trait_id| gene_id| cs_id| variant| rsid|cs_size| pip| pvalue| beta| se| z|cs_min_r2| region|
+--------------------+---------------+--------------------+--------------------+-----------+-------+-----------------+-----------+--------+---------+-----------------+---------+--------------------+
|ENSG00000159111.1...|ENSG00000159111|ENSG00000159111.1...|chr17_47829273_A_...|rs377178742| 1|0.998347615745736|2.54207E-11|-0.38678|0.0546454|-7.29249726981989| 1.0|chr17:46827139-48...|
|ENSG00000159111.1...|ENSG00000159111|ENSG00000159111.1...|chr17_47829273_A_...|rs770565799| 1|0.998347615745736|2.54207E-11|-0.38678|0.0546454|-7.29249726981989| 1.0|chr17:46827139-48...|
|ENSG00000159111.1...|ENSG00000159111|ENSG00000159111.1...|chr17_47829273_A_...|rs776267482| 1|0.998347615745736|2.54207E-11|-0.38678|0.0546454|-7.29249726981989| 1.0|chr17:46827139-48...|
|ENSG00000159111.1...|ENSG00000159111|ENSG00000159111.1...|chr17_47829273_A_...|rs759373809| 1|0.998347615745736|2.54207E-11|-0.38678|0.0546454|-7.29249726981989| 1.0|chr17:46827139-48...|
|ENSG00000159111.1...|ENSG00000159111|ENSG00000159111.1...|chr17_47829273_A_...|rs765271788| 1|0.998347615745736|2.54207E-11|-0.38678|0.0546454|-7.29249726981989| 1.0|chr17:46827139-48...|
|ENSG00000159111.1...|ENSG00000159111|ENSG00000159111.1...|chr17_47829273_A_...|rs762770771| 1|0.998347615745736|2.54207E-11|-0.38678|0.0546454|-7.29249726981989| 1.0|chr17:46827139-48...|
|ENSG00000159111.1...|ENSG00000159111|ENSG00000159111.1...|chr17_47829273_A_...|rs764253842| 1|0.998347615745736|2.54207E-11|-0.38678|0.0546454|-7.29249726981989| 1.0|chr17:46827139-48...|
|ENSG00000159111.1...|ENSG00000159111|ENSG00000159111.1...|chr17_47829273_A_...|rs751795102| 1|0.998347615745736|2.54207E-11|-0.38678|0.0546454|-7.29249726981989| 1.0|chr17:46827139-48...|
|ENSG00000159111.1...|ENSG00000159111|ENSG00000159111.1...|chr17_47829273_A_...|rs757261331| 1|0.998347615745736|2.54207E-11|-0.38678|0.0546454|-7.29249726981989| 1.0|chr17:46827139-48...|
|ENSG00000159111.1...|ENSG00000159111|ENSG00000159111.1...|chr17_47829273_A_...|rs781245758| 1|0.998347615745736|2.54207E-11|-0.38678|0.0546454|-7.29249726981989| 1.0|chr17:46827139-48...|
|ENSG00000159111.1...|ENSG00000159111|ENSG00000159111.1...|chr17_47829273_A_...|rs750564669| 1|0.998347615745736|2.54207E-11|-0.38678|0.0546454|-7.29249726981989| 1.0|chr17:46827139-48...|
+--------------------+---------------+--------------------+--------------------+-----------+-------+-----------------+-----------+--------+---------+-----------------+---------+--------------------+
The credible set size telling though: it says, the size of the credible set is 1! However it is exploded into 11 rows.
This is a hypothesis, but I assume, if a variant has multiple synonymous rsids, those might be exploded to multiple rows, still representing the same variant. Therefore I would suggest to drop the rsid
column from the credible set file, then distinct the dataset.
Verifying assumption on an other credible set:
credible_sets_df.select('region', 'cs_id','variant', 'rsid', 'cs_size').show(truncate=False)
giving:
+------------------------+-------------------------------------------+--------------------+-----------+-------+
|region |cs_id |variant |rsid |cs_size|
+------------------------+-------------------------------------------+--------------------+-----------+-------+
|chr1:99262333-101262333 |ENSG00000224616.4_1_100261863_100262804_L1 |chr1_100107320_T_A |rs143305704|6 |
|chr1:99262333-101262333 |ENSG00000224616.4_1_100261863_100262804_L1 |chr1_100107327_C_G |rs146122041|6 |
|chr1:99262333-101262333 |ENSG00000224616.4_1_100261863_100262804_L1 |chr1_100113425_G_A |rs55646662 |6 |
|chr1:-2992-1997008 |ENSG00000272512.1_1_995966_998051_L1 |chr1_1001870_C_G |rs28615823 |4 |
|chr1:99262333-101262333 |ENSG00000224616.4_1_100261863_100262804_L1 |chr1_100288534_G_A |rs116180293|6 |
|chr1:100082229-102082229|ENSG00000233184.8_1_101080999_101083460_L1 |chr1_101063462_A_G |rs17123647 |10 |
|chr1:100082229-102082229|ENSG00000233184.8_1_101080999_101083460_L1 |chr1_101064635_C_A |rs3861735 |10 |
|chr1:100082229-102082229|ENSG00000233184.8_1_101080999_101083460_L1 |chr1_101067101_T_C |rs3861736 |10 |
|chr1:100082229-102082229|ENSG00000233184.8_1_101080999_101083460_L1 |chr1_101077649_C_A |rs4617425 |10 |
|chr1:100082229-102082229|ENSG00000233184.8_1_101080999_101083460_L1 |chr1_101080883_T_A |rs1335745 |10 |
|chr1:100082229-102082229|ENSG00000233184.8_1_101080999_101083460_L1 |chr1_101092965_G_A |rs6681915 |10 |
|chr1:100082229-102082229|ENSG00000233184.8_1_101080999_101083460_L1 |chr1_101097445_G_A |rs12734559 |10 |
|chr1:100082229-102082229|ENSG00000233184.8_1_101080999_101083460_L1 |chr1_101106332_G_A |rs12754497 |10 |
|chr1:100082229-102082229|ENSG00000233184.8_1_101080999_101083460_L1 |chr1_101109649_T_C |rs36023027 |10 |
|chr1:100082229-102082229|ENSG00000233184.8_1_101080999_101083460_L1 |chr1_101116249_ATT_A|rs200745387|10 |
|chr1:100082229-102082229|ENSG00000233184.8_1_101080999_101083460_L1 |chr1_101116249_ATT_A|rs773683442|10 |
|chr1:100082229-102082229|ENSG00000233184.8_1_101080999_101083460_L1 |chr1_101116249_ATT_A|rs10599112 |10 |
|chr1:100996916-102996916|ENSG00000118733.17_1_101996906_101996926_L1|chr1_101960612_C_T |rs17487709 |9 |
|chr1:100996916-102996916|ENSG00000118733.17_1_101996906_101996926_L1|chr1_101972448_G_C |rs4908208 |9 |
|chr1:100996916-102996916|ENSG00000118733.17_1_101996906_101996926_L1|chr1_101991150_TA_T |rs11329059 |9 |
+------------------------+-------------------------------------------+--------------------+-----------+-------+
only showing top 20 rows
Selecting region: chr1:100082229-102082229
The resulting credible set:
{
"studyType": "eqtl",
"variantId": "1_101116249_ATT_A",
"chromosome": "1",
"position": 101116249,
"region": "chr1:100082229-102082229",
"studyId": "Walker_2019_exon_Neocortex_ENSG00000233184.8_1_101080999_101083460",
"beta": 1.10259,
"pValueMantissa": 1.1399999856948853,
"pValueExponent": -29,
"standardError": 0.081925,
"finemappingMethod": "SuSie",
"credibleSetIndex": 1,
"locus": [
{
"variantId": "1_101116249_ATT_A",
"posteriorProbability": 0.396304744880038,
"pValueMantissa": 1.1399999856948853,
"pValueExponent": -29,
"logBF": 92.9045948948245,
"beta": 1.10259,
"standardError": 0.081925,
"is95CredibleSet": true,
"is99CredibleSet": true
},
{
"variantId": "1_101116249_ATT_A",
"posteriorProbability": 0.396304744880038,
"pValueMantissa": 1.1399999856948853,
"pValueExponent": -29,
"logBF": 92.9045948948245,
"beta": 1.10259,
"standardError": 0.081925,
"is95CredibleSet": true,
"is99CredibleSet": true
},
{
"variantId": "1_101116249_ATT_A",
"posteriorProbability": 0.396304744880038,
"pValueMantissa": 1.1399999856948853,
"pValueExponent": -29,
"logBF": 92.9045948948245,
"beta": 1.10259,
"standardError": 0.081925,
"is95CredibleSet": true,
"is99CredibleSet": true
},
{
"variantId": "1_101064635_C_A",
"posteriorProbability": 0.122261723696096,
"pValueMantissa": 2.805000066757202,
"pValueExponent": -29,
"logBF": 91.7236399455199,
"beta": 1.054,
"standardError": 0.0790674,
"is95CredibleSet": false,
"is99CredibleSet": false
},
{
"variantId": "1_101067101_T_C",
"posteriorProbability": 0.111936706683957,
"pValueMantissa": 3.0929999351501465,
"pValueExponent": -29,
"logBF": 91.6347496215131,
"beta": 1.03552,
"standardError": 0.077762,
"is95CredibleSet": false,
"is99CredibleSet": false
},
{
"variantId": "1_101077649_C_A",
"posteriorProbability": 0.0777797001677824,
"pValueMantissa": 4.349999904632568,
"pValueExponent": -29,
"logBF": 91.26725359487,
"beta": 1.03997,
"standardError": 0.0783816,
"is95CredibleSet": false,
"is99CredibleSet": false
},
{
"variantId": "1_101092965_G_A",
"posteriorProbability": 0.0658009295016921,
"pValueMantissa": 5.13100004196167,
"pValueExponent": -29,
"logBF": 91.0979558810145,
"beta": 1.03681,
"standardError": 0.0782822,
"is95CredibleSet": false,
"is99CredibleSet": false
},
{
"variantId": "1_101106332_G_A",
"posteriorProbability": 0.065119220913797,
"pValueMantissa": 5.236999988555908,
"pValueExponent": -29,
"logBF": 91.0873996688308,
"beta": 1.0324,
"standardError": 0.0779658,
"is95CredibleSet": false,
"is99CredibleSet": false
},
{
"variantId": "1_101097445_G_A",
"posteriorProbability": 0.0537703720198748,
"pValueMantissa": 6.247000217437744,
"pValueExponent": -29,
"logBF": 90.8930452720012,
"beta": 1.03327,
"standardError": 0.0781798,
"is95CredibleSet": false,
"is99CredibleSet": false
},
{
"variantId": "1_101109649_T_C",
"posteriorProbability": 0.0438198628785832,
"pValueMantissa": 7.86299991607666,
"pValueExponent": -29,
"logBF": 90.6846625471859,
"beta": 1.03619,
"standardError": 0.0785956,
"is95CredibleSet": false,
"is99CredibleSet": false
},
{
"variantId": "1_101080883_T_A",
"posteriorProbability": 0.0131421670787694,
"pValueMantissa": 2.361999988555908,
"pValueExponent": -28,
"logBF": 89.4320172326054,
"beta": 1.02774,
"standardError": 0.0788918,
"is95CredibleSet": false,
"is99CredibleSet": false
},
{
"variantId": "1_101063462_A_G",
"posteriorProbability": 0.0119581296165722,
"pValueMantissa": 2.690000057220459,
"pValueExponent": -28,
"logBF": 89.3305703016833,
"beta": 1.03036,
"standardError": 0.0792053,
"is95CredibleSet": false,
"is99CredibleSet": false
}
],
"studyLocusId": "2dab627c229e104cd400bde0a8c419ac",
"credibleSetlog10BF": 94.38858795166016
}
Az expected:
1_101116249_ATT_A
is repeated three times.Let's apply the proposed fix:
{
"studyType": "eqtl",
"variantId": "1_101116249_ATT_A",
"chromosome": "1",
"position": 101116249,
"region": "chr1:100082229-102082229",
"studyId": "Walker_2019_exon_Neocortex_ENSG00000233184.8_1_101080999_101083460",
"beta": 1.10259,
"pValueMantissa": 1.1399999856948853,
"pValueExponent": -29,
"standardError": 0.081925,
"finemappingMethod": "SuSie",
"credibleSetIndex": 1,
"locus": [
{
"variantId": "1_101116249_ATT_A",
"posteriorProbability": 0.396304744880038,
"pValueMantissa": 1.1399999856948853,
"pValueExponent": -29,
"logBF": 92.9045948948245,
"beta": 1.10259,
"standardError": 0.081925,
"is95CredibleSet": true,
"is99CredibleSet": true
},
{
"variantId": "1_101064635_C_A",
"posteriorProbability": 0.122261723696096,
"pValueMantissa": 2.805000066757202,
"pValueExponent": -29,
"logBF": 91.7236399455199,
"beta": 1.054,
"standardError": 0.0790674,
"is95CredibleSet": true,
"is99CredibleSet": true
},
{
"variantId": "1_101067101_T_C",
"posteriorProbability": 0.111936706683957,
"pValueMantissa": 3.0929999351501465,
"pValueExponent": -29,
"logBF": 91.6347496215131,
"beta": 1.03552,
"standardError": 0.077762,
"is95CredibleSet": true,
"is99CredibleSet": true
},
{
"variantId": "1_101077649_C_A",
"posteriorProbability": 0.0777797001677824,
"pValueMantissa": 4.349999904632568,
"pValueExponent": -29,
"logBF": 91.26725359487,
"beta": 1.03997,
"standardError": 0.0783816,
"is95CredibleSet": true,
"is99CredibleSet": true
},
{
"variantId": "1_101092965_G_A",
"posteriorProbability": 0.0658009295016921,
"pValueMantissa": 5.13100004196167,
"pValueExponent": -29,
"logBF": 91.0979558810145,
"beta": 1.03681,
"standardError": 0.0782822,
"is95CredibleSet": true,
"is99CredibleSet": true
},
{
"variantId": "1_101106332_G_A",
"posteriorProbability": 0.065119220913797,
"pValueMantissa": 5.236999988555908,
"pValueExponent": -29,
"logBF": 91.0873996688308,
"beta": 1.0324,
"standardError": 0.0779658,
"is95CredibleSet": true,
"is99CredibleSet": true
},
{
"variantId": "1_101097445_G_A",
"posteriorProbability": 0.0537703720198748,
"pValueMantissa": 6.247000217437744,
"pValueExponent": -29,
"logBF": 90.8930452720012,
"beta": 1.03327,
"standardError": 0.0781798,
"is95CredibleSet": true,
"is99CredibleSet": true
},
{
"variantId": "1_101109649_T_C",
"posteriorProbability": 0.0438198628785832,
"pValueMantissa": 7.86299991607666,
"pValueExponent": -29,
"logBF": 90.6846625471859,
"beta": 1.03619,
"standardError": 0.0785956,
"is95CredibleSet": true,
"is99CredibleSet": true
},
{
"variantId": "1_101080883_T_A",
"posteriorProbability": 0.0131421670787694,
"pValueMantissa": 2.361999988555908,
"pValueExponent": -28,
"logBF": 89.4320172326054,
"beta": 1.02774,
"standardError": 0.0788918,
"is95CredibleSet": true,
"is99CredibleSet": true
},
{
"variantId": "1_101063462_A_G",
"posteriorProbability": 0.0119581296165722,
"pValueMantissa": 2.690000057220459,
"pValueExponent": -28,
"logBF": 89.3305703016833,
"beta": 1.03036,
"standardError": 0.0792053,
"is95CredibleSet": true,
"is99CredibleSet": true
}
],
"studyLocusId": "2dab627c229e104cd400bde0a8c419ac",
"credibleSetlog10BF": 93.78443908691406
}
The variant 1_101116249_ATT_A
appears only once in the locus!!
Updating qtl catalogue tests to look at problematic credible sets. This means tinkering with the sample datasets as well.
(Pdb) self.study_locus.df.select(f.size("locus").alias("locus_size"), f.size(f.array_distinct("locus")).alias("locus_distinct_size")).show()
+----------+-------------------+
|locus_size|locus_distinct_size|
+----------+-------------------+
| 45| 15|
| 34| 6|
| 25| 19|
+----------+-------------------+
This then tested by an assertion:
find_discrepancies = self.study_locus.df.select(
f.size("locus").alias("locus_size"),
f.size(f.array_distinct("locus")).alias("locus_distinct_size"),
).filter(f.col("locus_size") == f.col("locus_distinct_size"))
assert find_discrepancies.count() == 0
This test then passes when the new logic is introduced.
Finngen's issues are still not fixed - reopening
Finngen dag was rerun today with success:
The results are in the staging bucket. I have made a short analysis if the issue with duplicated loci still exists using the code above:
finngen_post_dag_data_checks.pdf
Two checks with success:
FINNGEN_R11_
exists in both studyIndex and StudyLocus filesI think we can consider this issue as closed.
In the post ETL credible set dataset (
gs://ot_orchestration/releases/24.10_freeze1
) a large number (~780k) of credible sets contain non-unique list of tagging variants [1]:On specific example from finngen:
0841316cd6b1a6106a686ffdc9e83ff9
. This is how the post-ETL credible set looks like:This data is already filtered for 99% credible interval upon running the ETL. The source data looks like this:
[1]: To get the distribution of non-unique loci: