Closed JarrodBaker closed 1 year ago
For the time being the credible set (pre ETL) has been patched in the same way as reported in #2670 for coloc.
New output: gs://genetics-portal-dev-staging/finemapping/220224_merged/credset
Old output: gs://genetics-portal-dev-staging/finemapping/220224_merged/credset_obsolete
The snippet:
credset_path = 'gs://genetics-portal-dev-staging/finemapping/220224_merged/credset_obsolete'
phenotype_gene_path = 'gs://ot-team/dochoa/phenotype_id_gene_luts/'
credset = spark.read.json(credset_path)
phenotype_gene_lut = spark.read.csv(phenotype_gene_path, sep='\t', header=True).select(F.col('phenotype_id').alias('right_phenotype_id'), F.col('gene_id').alias('right_gene_id'))
credset_fix = (
credset.join(
phenotype_gene_lut,
credset['phenotype_id'].eqNullSafe(phenotype_gene_lut['right_phenotype_id']),
how='left',
)
.drop('right_phenotype_id')
.withColumn(
'gene_id',
F.when(F.col('right_gene_id').isNull(), F.regexp_extract('phenotype_id', r'(ENSG\d+)', 1)).otherwise(
F.col('right_gene_id')
),
)
.drop('right_gene_id', 'phenotype_id')
.distinct()
)
credset_fix.repartition(200).write.option('compression', 'gzip').json('gs://genetics-portal-dev-staging/finemapping/220224_merged/credset')
Some checks:
credset.count()
>> 60090436
credset_fix.count() # Again, slight increase due to the 1:M phenotype/gene relationship
>> 60095556
credset_fix.filter(F.col('type') != 'gwas').filter(F.col('gene_id').isNull()).count()
>> 0
credset_fix.filter(F.col('type') == 'gwas').filter(F.col('gene_id').isNull()).count()
>> 5077645
credset_fix.schema == credset.schema
>> False
In this case the fix results in a different schema because we are dropping phenotype_id
in favour of gene_id
.
credset_fix.printSchema()
>> root
|-- bio_feature: string (nullable = true)
|-- is95_credset: boolean (nullable = true)
|-- is99_credset: boolean (nullable = true)
|-- lead_alt: string (nullable = true)
|-- lead_chrom: string (nullable = true)
|-- lead_pos: long (nullable = true)
|-- lead_ref: string (nullable = true)
|-- lead_variant_id: string (nullable = true)
|-- logABF: double (nullable = true)
|-- multisignal_method: string (nullable = true)
|-- postprob: double (nullable = true)
|-- postprob_cumsum: double (nullable = true)
|-- study_id: string (nullable = true)
|-- tag_alt: string (nullable = true)
|-- tag_beta: double (nullable = true)
|-- tag_beta_cond: double (nullable = true)
|-- tag_chrom: string (nullable = true)
|-- tag_pos: long (nullable = true)
|-- tag_pval: double (nullable = true)
|-- tag_pval_cond: double (nullable = true)
|-- tag_ref: string (nullable = true)
|-- tag_se: double (nullable = true)
|-- tag_se_cond: double (nullable = true)
|-- tag_variant_id: string (nullable = true)
|-- type: string (nullable = true)
|-- gene_id: string (nullable = true)
we are dropping phenotype_id in favour of gene_id.
@ireneisdoomed we want to keep the phenotype_id
field, and add the gene_id
field. We use the phenotype_id
field in the FE.
@JarrodBaker File is fixed now. gene_id
and phenotype_id
are complete. Thanks ^^
@xyg123 Since the credible set is one of L2G's inputs, this means that we will have to rerun L2G, right?
We will work on this in the new pipeline
Of the ~60m rows in the
v2d_credset
data, an estimated ~21m records are not able to be displayed in the front-end due to an API/data bug.The
v2d_credset
contains a fieldphenotype_id
which sometimes contains an ENSG ID value. The API assumes that this column always contains an ENSG ID and uses it to map to variants. When this field is not an ENSG ID the mapping cannot occur.Estimated number of records effected by data type:
The
sqtl
records will be resolved in opentargets/platform#2688 .As an example, consider:
Navigating to the web UI (https://genetics.opentargets.org/variant/4_270769_G_A) we can see that there is no listed eQTL data for this study.
Ticket #2688 will add a column
gene_id
to the database.This only solves the issues of
sQTL
s, however all molecular traits should have an ENSG ID to allow these to be displayed on the FE.Currently the
credset
data is used as an input to thecoloc
data pipeline.coloc
uses thephenotype_id
information to annotate the record with agene_id
. This annotation should be moved up-stream tocredset
(and potentially further) so that we can reliably linkcredset
data to genes.No change should be required on the FE or API. The requested changes in opentargets/platform#2688 should automatically apply to this change.
This issue effects production as well as the development environment.