opentargets / issues

Issue tracker for Open Targets Platform and Open Targets Genetics Portal
https://platform.opentargets.org https://genetics.opentargets.org
Apache License 2.0
12 stars 2 forks source link

Finngen and QTL catalogue credible sets have redundant tags in loci object #3570

Closed DSuveges closed 2 weeks ago

DSuveges commented 3 weeks ago

In the post ETL credible set dataset (gs://ot_orchestration/releases/24.10_freeze1) a large number (~780k) of credible sets contain non-unique list of tagging variants [1]:

+---------------------+------+
|projectId            |count |
+---------------------+------+
|Braineac2            |2050  |
|Kasela_2017          |2237  |
|BLUEPRINT            |35692 |
|OneK1K               |10080 |
|Schmiedel_2018       |39713 |
|Steinberg_2020       |4642  |
|Naranbhai_2015       |775   |
|CommonMind           |17299 |
|GENCORD              |8946  |
|CEDAR                |9661  |
|Walker_2019          |5167  |
|Nedelec_2016         |8697  |
|Aygun_2021           |2952  |
|TwinsUK              |28682 |
|Lepik_2017           |10159 |
|Cytoimmgen           |12529 |
|FINNGEN_R11_         |15503 |
|Alasoo_2018          |15570 |
|Randolph_2021        |609   |
|Jerber_2021          |3628  |
|ROSMAP               |23325 |
|Nathan_2022          |7533  |
|Sun_2018             |533   |
|Kim-Hellmuth_2017    |3685  |
|Young_2019           |893   |
|Perez_2022           |1126  |
|PhLiPS               |1892  |
|Bossini-Castillo_2019|8621  |
|Fairfax_2014         |11088 |
|van_de_Bunt_2015     |3203  |
|BrainSeq             |14785 |
|GTEx                 |378658|
|FUSION               |39510 |
|Fairfax_2012         |2019  |
|Quach_2016           |26332 |
|Gilchrist_2021       |2101  |
|Schwartzentruber_2018|3624  |
|PISA                 |3066  |
|Peng_2018            |2597  |
+---------------------+------+

On specific example from finngen: 0841316cd6b1a6106a686ffdc9e83ff9. This is how the post-ETL credible set looks like:

{
   "studyLocusId": "0841316cd6b1a6106a686ffdc9e83ff9",
   "studyId": "K11_CHOLELITH",
   "variantId": "11_1355653_C_T",
   "locus": [
      {
         "is95CredibleSet": true,
         "is99CredibleSet": true,
         "logBF": 16.9630462340828,
         "posteriorProbability": 0.294295702799749,
         "variantId": "11_1355653_C_T",
         "pValueMantissa": 9.866999626159668,
         "pValueExponent": -11,
         "beta": 0.0657921,
         "standardError": 0.0101704,
         "r2Overall": null
      },
      {
         "is95CredibleSet": true,
         "is99CredibleSet": true,
         "logBF": 16.9630462340828,
         "posteriorProbability": 0.294295702799749,
         "variantId": "11_1355653_C_T",
         "pValueMantissa": 9.866999626159668,
         "pValueExponent": -11,
         "beta": 0.0657921,
         "standardError": 0.0101704,
         "r2Overall": null
      },
      {
         "is95CredibleSet": true,
         "is99CredibleSet": true,
         "logBF": 16.9630462340828,
         "posteriorProbability": 0.294295702799749,
         "variantId": "11_1355653_C_T",
         "pValueMantissa": 9.866999626159668,
         "pValueExponent": -11,
         "beta": 0.0657921,
         "standardError": 0.0101704,
         "r2Overall": null
      },
      {
         "is95CredibleSet": true,
         "is99CredibleSet": true,
         "logBF": 16.9092476618534,
         "posteriorProbability": 0.278881366432721,
         "variantId": "11_1360830_C_A",
         "pValueMantissa": 1.0470000505447388,
         "pValueExponent": -10,
         "beta": 0.0656745,
         "standardError": 0.0101663,
         "r2Overall": null
      }
   ],
   "confidence": "Unknown confidence",
   "studyType": "gwas"
}

This data is already filtered for 99% credible interval upon running the ETL. The source data looks like this:

{
   "studyId": "K11_CHOLELITH",
   "variantId": "11_1355653_C_T",
   "locus": [
      {
         "variantId": "11_1355653_C_T",
         "posteriorProbability": 0.294295702799749,
         "logBF": 16.9630462340828,
         "pValueMantissa": 9.866999626159668,
         "pValueExponent": -11,
         "beta": 0.0657921,
         "standardError": 0.0101704,
         "is95CredibleSet": true,
         "is99CredibleSet": true
      },
      {
         "variantId": "11_1355653_C_T",
         "posteriorProbability": 0.294295702799749,
         "logBF": 16.9630462340828,
         "pValueMantissa": 9.866999626159668,
         "pValueExponent": -11,
         "beta": 0.0657921,
         "standardError": 0.0101704,
         "is95CredibleSet": true,
         "is99CredibleSet": true
      },
      {
         "variantId": "11_1355653_C_T",
         "posteriorProbability": 0.294295702799749,
         "logBF": 16.9630462340828,
         "pValueMantissa": 9.866999626159668,
         "pValueExponent": -11,
         "beta": 0.0657921,
         "standardError": 0.0101704,
         "is95CredibleSet": true,
         "is99CredibleSet": true
      },
      {
         "variantId": "11_1360830_C_A",
         "posteriorProbability": 0.278881366432721,
         "logBF": 16.9092476618534,
         "pValueMantissa": 1.0470000505447388,
         "pValueExponent": -10,
         "beta": 0.0656745,
         "standardError": 0.0101663,
         "is95CredibleSet": true,
         "is99CredibleSet": true
      },
      {
         "variantId": "11_1360830_C_A",
         "posteriorProbability": 0.278881366432721,
         "logBF": 16.9092476618534,
         "pValueMantissa": 1.0470000505447388,
         "pValueExponent": -10,
         "beta": 0.0656745,
         "standardError": 0.0101663,
         "is95CredibleSet": false,
         "is99CredibleSet": false
      },
      {
         "variantId": "11_1360830_C_A",
         "posteriorProbability": 0.278881366432721,
         "logBF": 16.9092476618534,
         "pValueMantissa": 1.0470000505447388,
         "pValueExponent": -10,
         "beta": 0.0656745,
         "standardError": 0.0101663,
         "is95CredibleSet": false,
         "is99CredibleSet": false
      },
      {
         "variantId": "11_1351380_C_T",
         "posteriorProbability": 0.215522856851774,
         "logBF": 16.6515281485747,
         "pValueMantissa": 1.3630000352859497,
         "pValueExponent": -10,
         "beta": 0.065255,
         "standardError": 0.0101644,
         "is95CredibleSet": false,
         "is99CredibleSet": false
      },
      {
         "variantId": "11_1351380_C_T",
         "posteriorProbability": 0.215522856851774,
         "logBF": 16.6515281485747,
         "pValueMantissa": 1.3630000352859497,
         "pValueExponent": -10,
         "beta": 0.065255,
         "standardError": 0.0101644,
         "is95CredibleSet": false,
         "is99CredibleSet": false
      },
      {
         "variantId": "11_1351380_C_T",
         "posteriorProbability": 0.215522856851774,
         "logBF": 16.6515281485747,
         "pValueMantissa": 1.3630000352859497,
         "pValueExponent": -10,
         "beta": 0.065255,
         "standardError": 0.0101644,
         "is95CredibleSet": false,
         "is99CredibleSet": false
      },
      {
         "variantId": "11_1373027_A_T",
         "posteriorProbability": 0.196970019594571,
         "logBF": 16.5615127127355,
         "pValueMantissa": 1.3580000400543213,
         "pValueExponent": -10,
         "beta": 0.0648917,
         "standardError": 0.010107,
         "is95CredibleSet": false,
         "is99CredibleSet": false
      },
      {
         "variantId": "11_1373027_A_T",
         "posteriorProbability": 0.196970019594571,
         "logBF": 16.5615127127355,
         "pValueMantissa": 1.3580000400543213,
         "pValueExponent": -10,
         "beta": 0.0648917,
         "standardError": 0.010107,
         "is95CredibleSet": false,
         "is99CredibleSet": false
      },
      {
         "variantId": "11_1373027_A_T",
         "posteriorProbability": 0.196970019594571,
         "logBF": 16.5615127127355,
         "pValueMantissa": 1.3580000400543213,
         "pValueExponent": -10,
         "beta": 0.0648917,
         "standardError": 0.010107,
         "is95CredibleSet": false,
         "is99CredibleSet": false
      }
   ]
}

[1]: To get the distribution of non-unique loci:

(
    spark.read.parquet('gs://ot_orchestration/releases/24.10_freeze1/credible_set/')
    .join(
        spark.read.parquet('gs://ot_orchestration/releases/24.10_freeze1/study_index').select('studyId', 'projectId'),
        on='studyId', how='inner'
    )
    .select(
        'projectId',
        f.size(f.array_distinct(f.col('locus'))).alias('uniqueTagCount'),
        f.size('locus').alias('locusSize')
    )
    .filter(f.col('uniqueTagCount')!=f.col('locusSize'))
    .groupby('projectId')
    .count()
    .show(1000,truncate=False)
)
DSuveges commented 3 weeks ago

The redundancy in Finngen studies probably can be explained by the redundant processing of input data. It has been fixed in orchestration. see PR#44

DSuveges commented 3 weeks ago

Redundancy in the EQTL Catalogue dataset is also quite prevalent. For these credible sets the explanation cannot be the same as for finngen. Let's see at the discrepancies:

+--------------------------------+---------+----------+------------------+
|studyLocusId                    |locusSize|uniqueSize|diff              |
+--------------------------------+---------+----------+------------------+
|a1a4f2ad30bd99b8d99b407c44282546|11       |2         |5.5               |
|ed46940a38948e1eec9f12147c4e7cb7|9        |2         |4.5               |
|6d86b145f3d5ba93f24af97454a107b0|9        |2         |4.5               |
|64523fa64e5b58742cc62782da91bb27|8        |2         |4.0               |
|473f790ab5bdc6ce8d1913cee373d103|8        |2         |4.0               |
|f357d44cc77ffa813eb9899f4be19630|7        |2         |3.5               |
|181e3f70e717711d86285bd65f74115f|7        |2         |3.5               |
|e7ce0b657a05bf138c7551bbf5ea5a5a|7        |2         |3.5               |
|642fe8a8c00a1120ec284b05517f24ee|7        |2         |3.5               |
|9d5afce72c791362a86cfd449346cc90|7        |2         |3.5               |
|98e4c31d803b389167f0859f6cc3c69d|17       |5         |3.4               |
|145e9749836a82e79ddfa6b29289c8ee|10       |3         |3.3333333333333335|
|73a32407a6460322019e5eda121615cc|19       |6         |3.1666666666666665|
|a5beb5e085c5911b9002994cc8be5bdc|6        |2         |3.0               |
|3565367ff63d7b5b3f43bc7b45ee3c67|6        |2         |3.0               |
|9552de8cbd44edec4f9570038a4f9d00|6        |2         |3.0               |
|07dabd281b999684a7ce70152ffb141f|6        |2         |3.0               |
|c9dedea9f3c93abab00798ab844b24d1|6        |2         |3.0               |
|9797c42224764bcb58fbdb4c52965cb8|6        |2         |3.0               |
|37b6db3dc373ef3aba7b51530db7158c|6        |2         |3.0               |
+--------------------------------+---------+----------+------------------+

For testing purposed I used credible set id: a1a4f2ad30bd99b8d99b407c44282546, where the locus contains 11 variants, however only 2 unique. Tracking down the credible set to this file from source:

When manually pushing through the data, the resulting credible set is already faulty:

{
   "studyType": "eqtl",
   "variantId": "17_47829273_A_AGAAG",
   "chromosome": "17",
   "position": 47829273,
   "region": "chr17:46827139-48827139",
   "studyId": "Walker_2019_exon_Neocortex_ENSG00000159111.13_17_47827074_47827204",
   "beta": -0.38678,
   "pValueMantissa": 2.5420000553131104,
   "pValueExponent": -11,
   "standardError": 0.0546454,
   "finemappingMethod": "SuSie",
   "credibleSetIndex": 1,
   "locus": [
      {
         "variantId": "17_47829273_A_AGAAG",
         "posteriorProbability": 0.998347615745736,
         "pValueMantissa": 2.5420000553131104,
         "pValueExponent": -11,
         "logBF": 24.224072096733,
         "beta": -0.38678,
         "standardError": 0.0546454,
         "is95CredibleSet": true,
         "is99CredibleSet": true
      },
      {
         "variantId": "17_47829273_A_AGAAG",
         "posteriorProbability": 0.998347615745736,
         "pValueMantissa": 2.5420000553131104,
         "pValueExponent": -11,
         "logBF": 24.224072096733,
         "beta": -0.38678,
         "standardError": 0.0546454,
         "is95CredibleSet": false,
         "is99CredibleSet": false
      },
      {
         "variantId": "17_47829273_A_AGAAG",
         "posteriorProbability": 0.998347615745736,
         "pValueMantissa": 2.5420000553131104,
         "pValueExponent": -11,
         "logBF": 24.224072096733,
         "beta": -0.38678,
         "standardError": 0.0546454,
         "is95CredibleSet": false,
         "is99CredibleSet": false
      },
      {
         "variantId": "17_47829273_A_AGAAG",
         "posteriorProbability": 0.998347615745736,
         "pValueMantissa": 2.5420000553131104,
         "pValueExponent": -11,
         "logBF": 24.224072096733,
         "beta": -0.38678,
         "standardError": 0.0546454,
         "is95CredibleSet": false,
         "is99CredibleSet": false
      },
      {
         "variantId": "17_47829273_A_AGAAG",
         "posteriorProbability": 0.998347615745736,
         "pValueMantissa": 2.5420000553131104,
         "pValueExponent": -11,
         "logBF": 24.224072096733,
         "beta": -0.38678,
         "standardError": 0.0546454,
         "is95CredibleSet": false,
         "is99CredibleSet": false
      },
      {
         "variantId": "17_47829273_A_AGAAG",
         "posteriorProbability": 0.998347615745736,
         "pValueMantissa": 2.5420000553131104,
         "pValueExponent": -11,
         "logBF": 24.224072096733,
         "beta": -0.38678,
         "standardError": 0.0546454,
         "is95CredibleSet": false,
         "is99CredibleSet": false
      },
      {
         "variantId": "17_47829273_A_AGAAG",
         "posteriorProbability": 0.998347615745736,
         "pValueMantissa": 2.5420000553131104,
         "pValueExponent": -11,
         "logBF": 24.224072096733,
         "beta": -0.38678,
         "standardError": 0.0546454,
         "is95CredibleSet": false,
         "is99CredibleSet": false
      },
      {
         "variantId": "17_47829273_A_AGAAG",
         "posteriorProbability": 0.998347615745736,
         "pValueMantissa": 2.5420000553131104,
         "pValueExponent": -11,
         "logBF": 24.224072096733,
         "beta": -0.38678,
         "standardError": 0.0546454,
         "is95CredibleSet": false,
         "is99CredibleSet": false
      },
      {
         "variantId": "17_47829273_A_AGAAG",
         "posteriorProbability": 0.998347615745736,
         "pValueMantissa": 2.5420000553131104,
         "pValueExponent": -11,
         "logBF": 24.224072096733,
         "beta": -0.38678,
         "standardError": 0.0546454,
         "is95CredibleSet": false,
         "is99CredibleSet": false
      },
      {
         "variantId": "17_47829273_A_AGAAG",
         "posteriorProbability": 0.998347615745736,
         "pValueMantissa": 2.5420000553131104,
         "pValueExponent": -11,
         "logBF": 24.224072096733,
         "beta": -0.38678,
         "standardError": 0.0546454,
         "is95CredibleSet": false,
         "is99CredibleSet": false
      },
      {
         "variantId": "17_47829273_A_AGAAG",
         "posteriorProbability": 0.998347615745736,
         "pValueMantissa": 2.5420000553131104,
         "pValueExponent": -11,
         "logBF": 24.224072096733,
         "beta": -0.38678,
         "standardError": 0.0546454,
         "is95CredibleSet": false,
         "is99CredibleSet": false
      }
   ],
   "studyLocusId": "a1a4f2ad30bd99b8d99b407c44282546",
   "credibleSetlog10BF": 26.621967315673828
}

Essentially this is the same variant repeated 11 times. When looking at the data, this discrepancy coming from the credible set file, where the individual rows are differentiated by the rsId of the variant:

+--------------------+---------------+--------------------+--------------------+-----------+-------+-----------------+-----------+--------+---------+-----------------+---------+--------------------+
|  molecular_trait_id|        gene_id|               cs_id|             variant|       rsid|cs_size|              pip|     pvalue|    beta|       se|                z|cs_min_r2|              region|
+--------------------+---------------+--------------------+--------------------+-----------+-------+-----------------+-----------+--------+---------+-----------------+---------+--------------------+
|ENSG00000159111.1...|ENSG00000159111|ENSG00000159111.1...|chr17_47829273_A_...|rs377178742|      1|0.998347615745736|2.54207E-11|-0.38678|0.0546454|-7.29249726981989|      1.0|chr17:46827139-48...|
|ENSG00000159111.1...|ENSG00000159111|ENSG00000159111.1...|chr17_47829273_A_...|rs770565799|      1|0.998347615745736|2.54207E-11|-0.38678|0.0546454|-7.29249726981989|      1.0|chr17:46827139-48...|
|ENSG00000159111.1...|ENSG00000159111|ENSG00000159111.1...|chr17_47829273_A_...|rs776267482|      1|0.998347615745736|2.54207E-11|-0.38678|0.0546454|-7.29249726981989|      1.0|chr17:46827139-48...|
|ENSG00000159111.1...|ENSG00000159111|ENSG00000159111.1...|chr17_47829273_A_...|rs759373809|      1|0.998347615745736|2.54207E-11|-0.38678|0.0546454|-7.29249726981989|      1.0|chr17:46827139-48...|
|ENSG00000159111.1...|ENSG00000159111|ENSG00000159111.1...|chr17_47829273_A_...|rs765271788|      1|0.998347615745736|2.54207E-11|-0.38678|0.0546454|-7.29249726981989|      1.0|chr17:46827139-48...|
|ENSG00000159111.1...|ENSG00000159111|ENSG00000159111.1...|chr17_47829273_A_...|rs762770771|      1|0.998347615745736|2.54207E-11|-0.38678|0.0546454|-7.29249726981989|      1.0|chr17:46827139-48...|
|ENSG00000159111.1...|ENSG00000159111|ENSG00000159111.1...|chr17_47829273_A_...|rs764253842|      1|0.998347615745736|2.54207E-11|-0.38678|0.0546454|-7.29249726981989|      1.0|chr17:46827139-48...|
|ENSG00000159111.1...|ENSG00000159111|ENSG00000159111.1...|chr17_47829273_A_...|rs751795102|      1|0.998347615745736|2.54207E-11|-0.38678|0.0546454|-7.29249726981989|      1.0|chr17:46827139-48...|
|ENSG00000159111.1...|ENSG00000159111|ENSG00000159111.1...|chr17_47829273_A_...|rs757261331|      1|0.998347615745736|2.54207E-11|-0.38678|0.0546454|-7.29249726981989|      1.0|chr17:46827139-48...|
|ENSG00000159111.1...|ENSG00000159111|ENSG00000159111.1...|chr17_47829273_A_...|rs781245758|      1|0.998347615745736|2.54207E-11|-0.38678|0.0546454|-7.29249726981989|      1.0|chr17:46827139-48...|
|ENSG00000159111.1...|ENSG00000159111|ENSG00000159111.1...|chr17_47829273_A_...|rs750564669|      1|0.998347615745736|2.54207E-11|-0.38678|0.0546454|-7.29249726981989|      1.0|chr17:46827139-48...|
+--------------------+---------------+--------------------+--------------------+-----------+-------+-----------------+-----------+--------+---------+-----------------+---------+--------------------+

The credible set size telling though: it says, the size of the credible set is 1! However it is exploded into 11 rows.

DSuveges commented 3 weeks ago

This is a hypothesis, but I assume, if a variant has multiple synonymous rsids, those might be exploded to multiple rows, still representing the same variant. Therefore I would suggest to drop the rsid column from the credible set file, then distinct the dataset.

DSuveges commented 3 weeks ago

Verifying assumption on an other credible set:

credible_sets_df.select('region', 'cs_id','variant', 'rsid', 'cs_size').show(truncate=False)

giving:

+------------------------+-------------------------------------------+--------------------+-----------+-------+
|region                  |cs_id                                      |variant             |rsid       |cs_size|
+------------------------+-------------------------------------------+--------------------+-----------+-------+
|chr1:99262333-101262333 |ENSG00000224616.4_1_100261863_100262804_L1 |chr1_100107320_T_A  |rs143305704|6      |
|chr1:99262333-101262333 |ENSG00000224616.4_1_100261863_100262804_L1 |chr1_100107327_C_G  |rs146122041|6      |
|chr1:99262333-101262333 |ENSG00000224616.4_1_100261863_100262804_L1 |chr1_100113425_G_A  |rs55646662 |6      |
|chr1:-2992-1997008      |ENSG00000272512.1_1_995966_998051_L1       |chr1_1001870_C_G    |rs28615823 |4      |
|chr1:99262333-101262333 |ENSG00000224616.4_1_100261863_100262804_L1 |chr1_100288534_G_A  |rs116180293|6      |
|chr1:100082229-102082229|ENSG00000233184.8_1_101080999_101083460_L1 |chr1_101063462_A_G  |rs17123647 |10     |
|chr1:100082229-102082229|ENSG00000233184.8_1_101080999_101083460_L1 |chr1_101064635_C_A  |rs3861735  |10     |
|chr1:100082229-102082229|ENSG00000233184.8_1_101080999_101083460_L1 |chr1_101067101_T_C  |rs3861736  |10     |
|chr1:100082229-102082229|ENSG00000233184.8_1_101080999_101083460_L1 |chr1_101077649_C_A  |rs4617425  |10     |
|chr1:100082229-102082229|ENSG00000233184.8_1_101080999_101083460_L1 |chr1_101080883_T_A  |rs1335745  |10     |
|chr1:100082229-102082229|ENSG00000233184.8_1_101080999_101083460_L1 |chr1_101092965_G_A  |rs6681915  |10     |
|chr1:100082229-102082229|ENSG00000233184.8_1_101080999_101083460_L1 |chr1_101097445_G_A  |rs12734559 |10     |
|chr1:100082229-102082229|ENSG00000233184.8_1_101080999_101083460_L1 |chr1_101106332_G_A  |rs12754497 |10     |
|chr1:100082229-102082229|ENSG00000233184.8_1_101080999_101083460_L1 |chr1_101109649_T_C  |rs36023027 |10     |
|chr1:100082229-102082229|ENSG00000233184.8_1_101080999_101083460_L1 |chr1_101116249_ATT_A|rs200745387|10     |
|chr1:100082229-102082229|ENSG00000233184.8_1_101080999_101083460_L1 |chr1_101116249_ATT_A|rs773683442|10     |
|chr1:100082229-102082229|ENSG00000233184.8_1_101080999_101083460_L1 |chr1_101116249_ATT_A|rs10599112 |10     |
|chr1:100996916-102996916|ENSG00000118733.17_1_101996906_101996926_L1|chr1_101960612_C_T  |rs17487709 |9      |
|chr1:100996916-102996916|ENSG00000118733.17_1_101996906_101996926_L1|chr1_101972448_G_C  |rs4908208  |9      |
|chr1:100996916-102996916|ENSG00000118733.17_1_101996906_101996926_L1|chr1_101991150_TA_T |rs11329059 |9      |
+------------------------+-------------------------------------------+--------------------+-----------+-------+
only showing top 20 rows

Selecting region: chr1:100082229-102082229

The resulting credible set:

{
   "studyType": "eqtl",
   "variantId": "1_101116249_ATT_A",
   "chromosome": "1",
   "position": 101116249,
   "region": "chr1:100082229-102082229",
   "studyId": "Walker_2019_exon_Neocortex_ENSG00000233184.8_1_101080999_101083460",
   "beta": 1.10259,
   "pValueMantissa": 1.1399999856948853,
   "pValueExponent": -29,
   "standardError": 0.081925,
   "finemappingMethod": "SuSie",
   "credibleSetIndex": 1,
   "locus": [
      {
         "variantId": "1_101116249_ATT_A",
         "posteriorProbability": 0.396304744880038,
         "pValueMantissa": 1.1399999856948853,
         "pValueExponent": -29,
         "logBF": 92.9045948948245,
         "beta": 1.10259,
         "standardError": 0.081925,
         "is95CredibleSet": true,
         "is99CredibleSet": true
      },
      {
         "variantId": "1_101116249_ATT_A",
         "posteriorProbability": 0.396304744880038,
         "pValueMantissa": 1.1399999856948853,
         "pValueExponent": -29,
         "logBF": 92.9045948948245,
         "beta": 1.10259,
         "standardError": 0.081925,
         "is95CredibleSet": true,
         "is99CredibleSet": true
      },
      {
         "variantId": "1_101116249_ATT_A",
         "posteriorProbability": 0.396304744880038,
         "pValueMantissa": 1.1399999856948853,
         "pValueExponent": -29,
         "logBF": 92.9045948948245,
         "beta": 1.10259,
         "standardError": 0.081925,
         "is95CredibleSet": true,
         "is99CredibleSet": true
      },
      {
         "variantId": "1_101064635_C_A",
         "posteriorProbability": 0.122261723696096,
         "pValueMantissa": 2.805000066757202,
         "pValueExponent": -29,
         "logBF": 91.7236399455199,
         "beta": 1.054,
         "standardError": 0.0790674,
         "is95CredibleSet": false,
         "is99CredibleSet": false
      },
      {
         "variantId": "1_101067101_T_C",
         "posteriorProbability": 0.111936706683957,
         "pValueMantissa": 3.0929999351501465,
         "pValueExponent": -29,
         "logBF": 91.6347496215131,
         "beta": 1.03552,
         "standardError": 0.077762,
         "is95CredibleSet": false,
         "is99CredibleSet": false
      },
      {
         "variantId": "1_101077649_C_A",
         "posteriorProbability": 0.0777797001677824,
         "pValueMantissa": 4.349999904632568,
         "pValueExponent": -29,
         "logBF": 91.26725359487,
         "beta": 1.03997,
         "standardError": 0.0783816,
         "is95CredibleSet": false,
         "is99CredibleSet": false
      },
      {
         "variantId": "1_101092965_G_A",
         "posteriorProbability": 0.0658009295016921,
         "pValueMantissa": 5.13100004196167,
         "pValueExponent": -29,
         "logBF": 91.0979558810145,
         "beta": 1.03681,
         "standardError": 0.0782822,
         "is95CredibleSet": false,
         "is99CredibleSet": false
      },
      {
         "variantId": "1_101106332_G_A",
         "posteriorProbability": 0.065119220913797,
         "pValueMantissa": 5.236999988555908,
         "pValueExponent": -29,
         "logBF": 91.0873996688308,
         "beta": 1.0324,
         "standardError": 0.0779658,
         "is95CredibleSet": false,
         "is99CredibleSet": false
      },
      {
         "variantId": "1_101097445_G_A",
         "posteriorProbability": 0.0537703720198748,
         "pValueMantissa": 6.247000217437744,
         "pValueExponent": -29,
         "logBF": 90.8930452720012,
         "beta": 1.03327,
         "standardError": 0.0781798,
         "is95CredibleSet": false,
         "is99CredibleSet": false
      },
      {
         "variantId": "1_101109649_T_C",
         "posteriorProbability": 0.0438198628785832,
         "pValueMantissa": 7.86299991607666,
         "pValueExponent": -29,
         "logBF": 90.6846625471859,
         "beta": 1.03619,
         "standardError": 0.0785956,
         "is95CredibleSet": false,
         "is99CredibleSet": false
      },
      {
         "variantId": "1_101080883_T_A",
         "posteriorProbability": 0.0131421670787694,
         "pValueMantissa": 2.361999988555908,
         "pValueExponent": -28,
         "logBF": 89.4320172326054,
         "beta": 1.02774,
         "standardError": 0.0788918,
         "is95CredibleSet": false,
         "is99CredibleSet": false
      },
      {
         "variantId": "1_101063462_A_G",
         "posteriorProbability": 0.0119581296165722,
         "pValueMantissa": 2.690000057220459,
         "pValueExponent": -28,
         "logBF": 89.3305703016833,
         "beta": 1.03036,
         "standardError": 0.0792053,
         "is95CredibleSet": false,
         "is99CredibleSet": false
      }
   ],
   "studyLocusId": "2dab627c229e104cd400bde0a8c419ac",
   "credibleSetlog10BF": 94.38858795166016
}

Az expected:

Let's apply the proposed fix:

{
   "studyType": "eqtl",
   "variantId": "1_101116249_ATT_A",
   "chromosome": "1",
   "position": 101116249,
   "region": "chr1:100082229-102082229",
   "studyId": "Walker_2019_exon_Neocortex_ENSG00000233184.8_1_101080999_101083460",
   "beta": 1.10259,
   "pValueMantissa": 1.1399999856948853,
   "pValueExponent": -29,
   "standardError": 0.081925,
   "finemappingMethod": "SuSie",
   "credibleSetIndex": 1,
   "locus": [
      {
         "variantId": "1_101116249_ATT_A",
         "posteriorProbability": 0.396304744880038,
         "pValueMantissa": 1.1399999856948853,
         "pValueExponent": -29,
         "logBF": 92.9045948948245,
         "beta": 1.10259,
         "standardError": 0.081925,
         "is95CredibleSet": true,
         "is99CredibleSet": true
      },
      {
         "variantId": "1_101064635_C_A",
         "posteriorProbability": 0.122261723696096,
         "pValueMantissa": 2.805000066757202,
         "pValueExponent": -29,
         "logBF": 91.7236399455199,
         "beta": 1.054,
         "standardError": 0.0790674,
         "is95CredibleSet": true,
         "is99CredibleSet": true
      },
      {
         "variantId": "1_101067101_T_C",
         "posteriorProbability": 0.111936706683957,
         "pValueMantissa": 3.0929999351501465,
         "pValueExponent": -29,
         "logBF": 91.6347496215131,
         "beta": 1.03552,
         "standardError": 0.077762,
         "is95CredibleSet": true,
         "is99CredibleSet": true
      },
      {
         "variantId": "1_101077649_C_A",
         "posteriorProbability": 0.0777797001677824,
         "pValueMantissa": 4.349999904632568,
         "pValueExponent": -29,
         "logBF": 91.26725359487,
         "beta": 1.03997,
         "standardError": 0.0783816,
         "is95CredibleSet": true,
         "is99CredibleSet": true
      },
      {
         "variantId": "1_101092965_G_A",
         "posteriorProbability": 0.0658009295016921,
         "pValueMantissa": 5.13100004196167,
         "pValueExponent": -29,
         "logBF": 91.0979558810145,
         "beta": 1.03681,
         "standardError": 0.0782822,
         "is95CredibleSet": true,
         "is99CredibleSet": true
      },
      {
         "variantId": "1_101106332_G_A",
         "posteriorProbability": 0.065119220913797,
         "pValueMantissa": 5.236999988555908,
         "pValueExponent": -29,
         "logBF": 91.0873996688308,
         "beta": 1.0324,
         "standardError": 0.0779658,
         "is95CredibleSet": true,
         "is99CredibleSet": true
      },
      {
         "variantId": "1_101097445_G_A",
         "posteriorProbability": 0.0537703720198748,
         "pValueMantissa": 6.247000217437744,
         "pValueExponent": -29,
         "logBF": 90.8930452720012,
         "beta": 1.03327,
         "standardError": 0.0781798,
         "is95CredibleSet": true,
         "is99CredibleSet": true
      },
      {
         "variantId": "1_101109649_T_C",
         "posteriorProbability": 0.0438198628785832,
         "pValueMantissa": 7.86299991607666,
         "pValueExponent": -29,
         "logBF": 90.6846625471859,
         "beta": 1.03619,
         "standardError": 0.0785956,
         "is95CredibleSet": true,
         "is99CredibleSet": true
      },
      {
         "variantId": "1_101080883_T_A",
         "posteriorProbability": 0.0131421670787694,
         "pValueMantissa": 2.361999988555908,
         "pValueExponent": -28,
         "logBF": 89.4320172326054,
         "beta": 1.02774,
         "standardError": 0.0788918,
         "is95CredibleSet": true,
         "is99CredibleSet": true
      },
      {
         "variantId": "1_101063462_A_G",
         "posteriorProbability": 0.0119581296165722,
         "pValueMantissa": 2.690000057220459,
         "pValueExponent": -28,
         "logBF": 89.3305703016833,
         "beta": 1.03036,
         "standardError": 0.0792053,
         "is95CredibleSet": true,
         "is99CredibleSet": true
      }
   ],
   "studyLocusId": "2dab627c229e104cd400bde0a8c419ac",
   "credibleSetlog10BF": 93.78443908691406
}

The variant 1_101116249_ATT_A appears only once in the locus!!

DSuveges commented 3 weeks ago

Updating qtl catalogue tests to look at problematic credible sets. This means tinkering with the sample datasets as well.

(Pdb) self.study_locus.df.select(f.size("locus").alias("locus_size"), f.size(f.array_distinct("locus")).alias("locus_distinct_size")).show()
+----------+-------------------+
|locus_size|locus_distinct_size|
+----------+-------------------+
|        45|                 15|
|        34|                  6|
|        25|                 19|
+----------+-------------------+

This then tested by an assertion:

        find_discrepancies = self.study_locus.df.select(
            f.size("locus").alias("locus_size"),
            f.size(f.array_distinct("locus")).alias("locus_distinct_size"),
        ).filter(f.col("locus_size") == f.col("locus_distinct_size"))
       assert find_discrepancies.count() == 0

This test then passes when the new logic is introduced.

ireneisdoomed commented 3 weeks ago

Finngen's issues are still not fixed - reopening

project-defiant commented 3 weeks ago

Finngen dag was rerun today with success:

The results are in the staging bucket. I have made a short analysis if the issue with duplicated loci still exists using the code above:

finngen_post_dag_data_checks.pdf

Two checks with success:

  1. Ensure that the prefix FINNGEN_R11_ exists in both studyIndex and StudyLocus files
  2. Ensure that no credible sets have duplicated loci
project-defiant commented 3 weeks ago

I think we can consider this issue as closed.