Unify gene index of the Genetics Portal and the Platform

DSuveges commented 2 years ago

As a developer I want to have a unified gene model for the Genetics Portal and Platform.

Background

The gene index used by the Genetics Portal and Platform is very similar, but not quite the same. It makes sense to unify these models to avoid duplication of effort to generate, duplication of code, accumulation of discrepancies.

Tasks

[ ] Get all gene annotation used by Platform and the Genetics Portal
[ ] As we can expect the Platform gene model is a superset of the Genetics with a few exception, moving the missing bits of the pipeline from genetics to platform.
[ ] Map all API endpoints of the Genetics Portal to the new, unified data model

Acceptance tests

The new pipeline generates the gene index.
The gene index has all required fields needed by both the portal and the genetics platform.
The Genetics Portal API picks up fields from the new index.

d0choa commented 2 years ago

The script create_genes_dictionary.py from the genetics pipeline is relevant to understand the logic in the genetics portal side.

Implementation-wise it's using a query in the Ensembl MySQL database. We can probably recover this information from the Ensembl JSON directly similar to how the platform ETL processes Ensembl information

JarrodBaker commented 2 years ago

@DSuveges, please have a look at platform#1929. The gene index in Genetics can be recreated using the existing Platform gene index with a select statement.

d0choa commented 2 years ago

Pasting the gene-index schema for reference based on @JarrodBaker comment here:

root
 |-- biotype: string (nullable = true)
 |-- chr: string (nullable = true)
 |-- description: string (nullable = true)
 |-- end: long (nullable = true)
 |-- exons: string (nullable = true)
 |-- fwdstrand: long (nullable = true)
 |-- gene_id: string (nullable = true)
 |-- gene_name: string (nullable = true)
 |-- start: long (nullable = true)
 |-- tss: long (nullable = true)

d0choa commented 2 years ago

Both the tss and exons (not available in the Platform target dataset) are required for the purpose of drawing the locus-plot in the genetics-app.

fwdstrand instead is not used but is there just for the purpose of deriving the tss as seen in the create_genes_dictionary.py:L111

As for the rest, I believe it's all in the Platform target dataset but we need to document which fields change name etc.

JarrodBaker commented 2 years ago

Thanks for the pickup @d0choa. I can regenerate tss easily in genetics using the target index. Exons requires a new fields to come out of the ETL. When I look at the Ensembl input file that we use (taken from ftp://ftp.ensembl.org/pub/release-105/json/homo_sapiens/homo_sapiens.json) and look at the structure I see (among other fields):

root                                                                                                                                                                                                                                        
 |-- ArrayExpress: array (nullable = true)                                                                                                                                                                                                  
 |    |-- element: string (containsNull = true)                                                                                                                                                                                             
 |-- CCDS: array (nullable = true)                                                                                                                                                                                                          
 |    |-- element: string (containsNull = true)                                                                                                                                                                                             
...
|-- taxon_id: long (nullable = true)                                                                                                                                                                                                       
 |-- transcripts: array (nullable = true)                                                                                                                                                                                                   
 |    |-- element: struct (containsNull = true)                                                                                                                                                                                             
 |    |    |-- exons: array (nullable = true)
 |    |    |    |-- element: struct (containsNull = true)
 |    |    |    |    |-- end: string (nullable = true)
 |    |    |    |    |-- id: string (nullable = true)
 |    |    |    |    |-- seq_region_name: string (nullable = true)
 |    |    |    |    |-- start: string (nullable = true)
 |    |    |    |    |-- strand: string (nullable = true)

Would those be the exons we're after?

d0choa commented 2 years ago

Yes they are. To do like-for-like we just need a flattened version of them

d0choa commented 2 years ago

A random example from the genetics portal gene index:

>>> gene.select("exons").printSchema()
root
 |-- exons: string (nullable = true)

>>> gene.select("exons").show(1, False, True)
-RECORD 0----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
 exons | [169893788,169893896,169888676,169888890,169878634,169878819,169875978,169876091,169873696,169873752,169870255,169870357,169868928,169869039,169866896,169866973,169864369,169864508,169862613,169862797,169859041,169859212,169854270,169854964,169849631,169853772]

We should probably add it properly to the Target index. Not as a string but using an array of objects with the same structure and all fields in Ensembl (e.g start, end, etc.).

We are going to need an intermediate step that filters the target index, renames fields, filters by biotype etc. so it's backwards compatible with the genetics portal. Eventually, we should aim to remove this intermediate step.

JarrodBaker commented 2 years ago

I'm having trouble identifying which is the correct transcript to extract the exons from. I download the json data from Ensembl and convert it to jsonl: wget ftp://ftp.ensembl.org/pub/release-105/json/homo_sapiens/homo_sapiens.json && jq -c '.genes[]' homo_sapiens.json >> homo_f.jsonl

When I read the file homo_f.jsonl into Spark I get the following schema:

jarrod-jarrod@ df.select('id, 'transcripts).printSchema            
root                                                     
 |-- id: string (nullable = true)                                  
 |-- transcripts: array (nullable = true)                              
... other fields                                  
 |    |    |-- exons: array (nullable = true)                      
 |    |    |    |-- element: struct (containsNull = true)
 |    |    |    |    |-- end: string (nullable = true)             
 |    |    |    |    |-- id: string (nullable = true)                  
 |    |    |    |    |-- seq_region_name: string (nullable = true) 
 |    |    |    |    |-- start: string (nullable = true)                
 |    |    |    |    |-- strand: string (nullable = true)    
...

transcripts is an array, and each transcript has an array of exons. I need to find out which transcript is the canonical transcript. Using a random gene as example you can see that one transcript is flagged as 'Ensembl canonical'.

This was previously handled in this script where there is field on the gene entry telling us the canonical transcript ID.

@DSuveges , @d0choa , please help!

PS. I've copied the raw file to gs://ot-team/jarrod/exons/homo_f.jsonl. It's ~2.8gb so best to use a VM.

d0choa commented 2 years ago

I was not very successful finding the info in the JSON, so I opened a ticket in the Ensembl helpdesk

d0choa commented 2 years ago

Ensembl's response:

Currently, there is no 'canonical transcript' flag in the JSON dumps, but we plan to add this to the files from Ensembl 108, which is due in September/October 2022.

ireneisdoomed commented 2 years ago

I just found out about the MANE Project and I think it is relevant: http://tark.ensembl.org/web/mane_project/

Among all the transcripts given for say COL1A1, the one tagged with MANE Select (ENST00000225964.10) indicates the most representative transcript for COL1A1. The project is particularly interesting because it harmonises Ensembl and RefSeq transcripts while ensuring perfect alignment to the GRCh38. They report in their publication 97% coverage of protein coding genes, which they will extend.

Although shown on the web, I don't see any flag on the JSON file. I'll keep an eye on their next one.

d0choa commented 2 years ago

Yes this is the project that feeds the definition of the canonical transcripts in Ensembl (documentation) since last year (blog).

This info won't make it to the Ensembl JSON until Sept/Oct 2022

DSuveges commented 2 years ago

If we can use the gene annotation file from GENCODE: ftp://ftp.ebi.ac.uk/pub/databases/gencode/Gencode_human/release_40/gencode.v40.annotation.gff3.gz

This script provides the list of transcripts with a column flagging if a transcript is ENSEMBL canonical:

@f.udf(StructType([
    StructField('gene_id', StringType(), False),
    StructField('transcript_id', StringType(), False),
    StructField('gene_name', StringType(), True),
    StructField('gene_type', StringType(), True),
    StructField('tag', StringType(), True)
]))
def parse_annotation(annotation: str) -> StructType:
    """Parsing gff3 sequence annotation to get a dictionary of values"""

    columns = ['transcript_id', 'gene_id', 'gene_name', 'gene_type', 'tag']
    return {
        feature.split('=')[0]: feature.split('=')[1] for feature in annotation.split(';') if feature.split('=')[0] in columns
    }

gff3_columns = ['strand', 'phase', 'annotation']

gff3_schema = StructType([
    StructField("chr", StringType(), False),
    StructField("source", StringType(), False),
    StructField("featureType", StringType(), False),
    StructField("start", IntegerType(), False),
    StructField("end", IntegerType(), False),
    StructField("score", StringType(), False),
    StructField("strand", StringType(), False),
    StructField("phase", StringType(), False),
    StructField("annotation", StringType(), False),
])
(
    spark.read.option("comment", "#")
    .csv('/Users/dsuveges/Downloads/gencode.v40.annotation.gff3.gz', sep='\t', schema=gff3_schema)

    # Filter for coding sequences only:
    .filter(f.col('featureType') == 'transcript')

    # Parsing GFF3 annotation:
    .withColumn('parsed_annotation', parse_annotation(f.col('annotation')))

    # Selecting columns + extract fields:
    .select('chr', 'start', 'end', 'strand', 'parsed_annotation.*')
    .withColumn('tag', f.split('tag', ','))

    # Annotate transcripts with canonical flag. If no tag is found for a transcript, we assume it is not canonical:
    .withColumn(
        'isCanonical',
         f.when(
             f.array_contains(f.col('tag'), 'Ensembl_canonical') == True, True
         ).otherwise(False)
    )

    # Saving parquet:
    .write.parquet(outFile)
)

It gives this for HER3:

+-----+--------+--------+------+------------------+-----------------+---------+--------------+-----------------------------------------------------------------+-----------+
|chr  |start   |end     |strand|gene_id           |transcript_id    |gene_name|gene_type     |tag                                                              |isCanonical|
+-----+--------+--------+------+------------------+-----------------+---------+--------------+-----------------------------------------------------------------+-----------+
|chr12|56076799|56083902|+     |ENSG00000065361.17|ENST00000643266.1|ERBB3    |protein_coding|[mRNA_end_NF, cds_end_NF, RNA_Seq_supported_only]                |false      |
|chr12|56076912|56078561|+     |ENSG00000065361.17|ENST00000643518.1|ERBB3    |protein_coding|null                                                             |false      |
|chr12|56079857|56085139|+     |ENSG00000065361.17|ENST00000549282.5|ERBB3    |protein_coding|[alternative_5_UTR, mRNA_end_NF, cds_end_NF]                     |false      |
|chr12|56079857|56086635|+     |ENSG00000065361.17|ENST00000549061.5|ERBB3    |protein_coding|[mRNA_end_NF, cds_end_NF]                                        |false      |
|chr12|56079874|56103486|+     |ENSG00000065361.17|ENST00000683059.1|ERBB3    |protein_coding|[RNA_Seq_supported_only, basic]                                  |false      |
|chr12|56080137|56085623|+     |ENSG00000065361.17|ENST00000411731.6|ERBB3    |protein_coding|[basic, CCDS]                                                    |false      |
|chr12|56080139|56087154|+     |ENSG00000065361.17|ENST00000549672.6|ERBB3    |protein_coding|[RNA_Seq_supported_only]                                         |false      |
|chr12|56080142|56102804|+     |ENSG00000065361.17|ENST00000682431.1|ERBB3    |protein_coding|[RNA_Seq_supported_only]                                         |false      |
|chr12|56080165|56103505|+     |ENSG00000065361.17|ENST00000267101.8|ERBB3    |protein_coding|[basic, Ensembl_canonical, MANE_Select, appris_principal_1, CCDS]|true       |
|chr12|56080167|56096769|+     |ENSG00000065361.17|ENST00000550869.5|ERBB3    |protein_coding|null                                                             |false      |
|chr12|56080168|56102092|+     |ENSG00000065361.17|ENST00000551085.5|ERBB3    |protein_coding|null                                                             |false      |
|chr12|56080172|56102804|+     |ENSG00000065361.17|ENST00000684500.1|ERBB3    |protein_coding|[RNA_Seq_supported_only]                                         |false      |
|chr12|56080204|56086672|+     |ENSG00000065361.17|ENST00000546884.1|ERBB3    |protein_coding|null                                                             |false      |
|chr12|56080205|56102834|+     |ENSG00000065361.17|ENST00000551242.5|ERBB3    |protein_coding|null                                                             |false      |
|chr12|56080347|56102804|+     |ENSG00000065361.17|ENST00000683653.1|ERBB3    |protein_coding|[RNA_Seq_supported_only]                                         |false      |
|chr12|56083313|56103486|+     |ENSG00000065361.17|ENST00000683164.1|ERBB3    |protein_coding|[RNA_Seq_supported_partial, basic]                               |false      |
|chr12|56083319|56102336|+     |ENSG00000065361.17|ENST00000415288.6|ERBB3    |protein_coding|[basic]                                                          |false      |
|chr12|56083323|56103486|+     |ENSG00000065361.17|ENST00000683018.1|ERBB3    |protein_coding|[upstream_ATG, RNA_Seq_supported_only, basic]                    |false      |
|chr12|56086519|56088064|+     |ENSG00000065361.17|ENST00000549472.1|ERBB3    |protein_coding|null                                                             |false      |
|chr12|56087824|56088868|+     |ENSG00000065361.17|ENST00000546748.1|ERBB3    |protein_coding|null                                                             |false      |
|chr12|56088629|56088996|+     |ENSG00000065361.17|ENST00000551176.1|ERBB3    |protein_coding|null                                                             |false      |
|chr12|56093483|56096063|+     |ENSG00000065361.17|ENST00000549205.1|ERBB3    |protein_coding|null                                                             |false      |
|chr12|56093514|56093992|+     |ENSG00000065361.17|ENST00000550828.1|ERBB3    |protein_coding|[mRNA_start_NF, cds_start_NF]                                    |false      |
|chr12|56093832|56102834|+     |ENSG00000065361.17|ENST00000550070.6|ERBB3    |protein_coding|[mRNA_start_NF, cds_start_NF, dotter_confirmed]                  |false      |
|chr12|56094522|56095527|+     |ENSG00000065361.17|ENST00000549644.1|ERBB3    |protein_coding|null                                                             |false      |
|chr12|56095991|56102806|+     |ENSG00000065361.17|ENST00000553131.5|ERBB3    |protein_coding|null                                                             |false      |
|chr12|56096205|56102834|+     |ENSG00000065361.17|ENST00000684766.1|ERBB3    |protein_coding|[RNA_Seq_supported_only]                                         |false      |
|chr12|56097091|56102804|+     |ENSG00000065361.17|ENST00000683142.1|ERBB3    |protein_coding|[RNA_Seq_supported_only]                                         |false      |
|chr12|56097134|56098980|+     |ENSG00000065361.17|ENST00000548709.1|ERBB3    |protein_coding|null                                                             |false      |
|chr12|56097726|56102804|+     |ENSG00000065361.17|ENST00000682512.1|ERBB3    |protein_coding|[RNA_Seq_supported_only]                                         |false      |
|chr12|56098251|56103499|+     |ENSG00000065361.17|ENST00000549832.1|ERBB3    |protein_coding|[basic]                                                          |false      |
|chr12|56099139|56102802|+     |ENSG00000065361.17|ENST00000682873.1|ERBB3    |protein_coding|[RNA_Seq_supported_only]                                         |false      |
|chr12|56099866|56100300|+     |ENSG00000065361.17|ENST00000552691.1|ERBB3    |protein_coding|null                                                             |false      |
+-----+--------+--------+------+------------------+-----------------+---------+--------------+-----------------------------------------------------------------+-----------+

Important to note, not all transcripts has tag annotation. When such annotation is missing, I think it is safe to assume that transcript cannot be a canonical transcript.

ireneisdoomed commented 2 years ago

For the record, I have checked the latest Ensembl version (Ensembl 107) and this annotation is still not available under transcripts.

JarrodBaker commented 2 years ago

Thanks @ireneisdoomed: I'm glad I was sitting down for that. For the time being we have the logic in place to get the information for elsewhere. Once it's available from Ensembl we'll use their data.

d0choa commented 2 years ago

Still within the timeline they provided. We'll see in their next release

d0choa commented 2 years ago

In the latest release of Ensembl (108) we can find a LUT with the canonical transcripts: https://ftp.ensembl.org/pub/release-108/tsv/homo_sapiens/Homo_sapiens.GRCh38.108.canonical.tsv.gz

❯ head Homo_sapiens.GRCh38.108.canonical.tsv
ENSG00000210049.1   ENST00000387314.1   Ensembl Canonical
ENSG00000211459.2   ENST00000389680.2   Ensembl Canonical
ENSG00000210077.1   ENST00000387342.1   Ensembl Canonical
ENSG00000210082.2   ENST00000387347.2   Ensembl Canonical
ENSG00000209082.1   ENST00000386347.1   Ensembl Canonical
ENSG00000198888.2   ENST00000361390.2   Ensembl Canonical
ENSG00000210100.1   ENST00000387365.1   Ensembl Canonical
ENSG00000210107.1   ENST00000387372.1   Ensembl Canonical
ENSG00000210112.1   ENST00000387377.1   Ensembl Canonical
ENSG00000198763.3   ENST00000361453.3   Ensembl Canonical

❯ wc -l Homo_sapiens.GRCh38.108.canonical.tsv
   88410 Homo_sapiens.GRCh38.108.canonical.tsv

We would like to add to the target object 2 new fields:

A canonicalTranscript StringType field containing the ENST id without the versioning appendix (.X). E.g. ENST00000387314. This field should not contain nulls
A canonicalExons column that mimic the original exons schema from Ensembl but only contains the exons in the canonical transcript. This field can contain nulls.

 |-- canonicalExons: array (nullable = true)
 |    |-- element: struct (containsNull = true)
 |    |    |-- end: string (nullable = true)
 |    |    |-- id: string (nullable = true)
 |    |    |-- seq_region_name: string (nullable = true)
 |    |    |-- start: string (nullable = true)
 |    |    |-- strand: string (nullable = true)

We don't need the flattened version of the exons in the target object.

ireneisdoomed commented 2 years ago

Adding up to what @d0choa commented above, we want to extract an additional column tss that will be based on the canonical transcript coordinates made available in Ensembl's target json file.

The previous proposal was to add 2 new fields: canonicalTranscript and canonicalExons.

Since we are interested in new information, we now want to encapsulate all transcript annotation under the same struct so that the schema is as follows:

target
 |-- id: string (nullable = true)
... other fields
 |-- canonicalTranscript: struct (nullable = false)
 |    |-- id: string (nullable = true)
 |    |-- start: long (nullable = true)
 |    |-- end: long (nullable = true)
 |    |-- exons: array (nullable = true)
 |    |    |-- element: struct (containsNull = true)
 |    |    |    |-- end: string (nullable = true)
 |    |    |    |-- id: string (nullable = true)
 |    |    |    |-- seq_region_name: string (nullable = true)
 |    |    |    |-- start: string (nullable = true)
 |    |    |    |-- strand: string (nullable = true)
 |    |-- tss: long (nullable = true)

The way to get to this is by:

Joining the Ensembl target JSON with the LUT to get the canonical transcripts.
- Note: we won't always have a canonical transcript
- Note/Question: MANE Plus Clinical is a small set of additional transcripts to also account for clinically relevant variants that fall outside the MANE transcript. If we keep this source, we will have 2 canonical transcripts for ~50 genes. Do we want to filter it out? @DSuveges @d0choa
Filter the transcripts array to only include annotation from the canonical transcript. In the end you always end up with an array of size 1 that you can explode.
Extract the TSS of the canonical transcript.
Filter the transcript struct to only include the fields we are interested in. We are not doing any transformation here other than selecting and adding the tss to the struct.

This is a prototype of how I would do it in Spark. I hope it helps.

schema = t.StructType(
    [
        t.StructField("id", t.StringType(), True),
        t.StructField("canonicalTranscriptId", t.StringType(), True),
        t.StructField("source", t.StringType(), True),
    ]
)
canonical_lut = (
    spark.read.csv("gs://ot-team/irene/human_canonical.tsv", sep="\t", schema=schema)
    .repartition(20)
    .select(
        f.split("id", "\.")[0].alias("id"),
        f.split("canonicalTranscriptId", "\.")[0].alias("canonicalTranscriptId"),
        "source",
    )
    .filter(f.col("source") != "MANE Plus Clinical")
)
targets = (
    spark.read.json("gs://open-targets-data-releases/22.09/input/target-inputs/ensembl/homo_sapiens.jsonl")
    .select("id", "strand", "transcripts")
    .repartition(20)
)

targets_with_canonical = (
    targets.join(canonical_lut, on="id", how="left")
    .select(
        "*",
        # Filter the transcripts array to keep only the canonical transcript
        f.explode(f.expr("filter(transcripts, array -> (array.id == canonicalTranscriptId))")).alias(
            "transcript_transformed"
        ),
    )
    .withColumn(
        "tss",
        f.when(f.col("strand") == 1, f.col("transcript_transformed.start")).otherwise(
            f.col("transcript_transformed.end")
        ),
    )
    .select(
        "id",
        "strand",
        f.struct(
            "transcript_transformed.id",
            "transcript_transformed.start",
            "transcript_transformed.end",
            "transcript_transformed.exons",
            "tss",
        ).alias("canonicalTranscript"),
    )
    .distinct()
)

targets_with_canonical.select("id", "canonicalTranscript.*").show(5)
+---------------+---------------+---------+---------+--------------------+---------+
|             id|             id|    start|      end|               exons|      tss|
+---------------+---------------+---------+---------+--------------------+---------+
|ENSG00000257231|ENST00000549314|113789542|113789807|[{113789807, ENSE...|113789807|
|ENSG00000211900|ENST00000390560|105863198|105863258|[{105863258, ENSE...|105863258|
|ENSG00000200681|ENST00000363811| 18306973| 18307079|[{18307079, ENSE0...| 18307079|
|ENSG00000233885|ENST00000425008|183806457|183809515|[{183808112, ENSE...|183809515|
|ENSG00000266279|ENST00000585275|  8337697|  8338758|[{8338758, ENSE00...|  8337697|
+---------------+---------------+---------+---------+--------------------+---------+

d0choa commented 2 years ago

I would keep it simple and store only one canonical per gene.

The rationale is that I randomly browsed some of the genes that would have more than one canonical transcript and the clinical information that I could extract mostly points to rare diseases, which is not our main use case for the canonical transcripts at the moment.

ireneisdoomed commented 2 years ago

I agree. In that case, the logic explained above is what we want to integrate.

ireneisdoomed commented 1 year ago

Just a note. I've checked the overlap between Ensembl's LUT and GENCODE's. The overlap is practically exact, however, there are some discrepancies (that might be due to differences in the version of the data):

GENCODE reports a different canonical transcript for 643 genes.
1263 genes are lacking this annotation from GENCODE at the moment.

All genes are covered by the Ensembl's LUT. A version of the latest target dataset (Ensembl 108) with these manual changes is available at gs://genetics_etl_python_playground/input/v2g_input/targets_correct_tss

d0choa commented 1 year ago

completed

opentargets / issues