Closed DSuveges closed 1 year ago
The script create_genes_dictionary.py from the genetics pipeline is relevant to understand the logic in the genetics portal side.
Implementation-wise it's using a query in the Ensembl MySQL database. We can probably recover this information from the Ensembl JSON directly similar to how the platform ETL processes Ensembl information
@DSuveges, please have a look at platform#1929. The gene index in Genetics can be recreated using the existing Platform gene index with a select statement.
Pasting the gene-index schema for reference based on @JarrodBaker comment here:
root
|-- biotype: string (nullable = true)
|-- chr: string (nullable = true)
|-- description: string (nullable = true)
|-- end: long (nullable = true)
|-- exons: string (nullable = true)
|-- fwdstrand: long (nullable = true)
|-- gene_id: string (nullable = true)
|-- gene_name: string (nullable = true)
|-- start: long (nullable = true)
|-- tss: long (nullable = true)
Both the tss
and exons
(not available in the Platform target
dataset) are required for the purpose of drawing the locus-plot in the genetics-app.
fwdstrand
instead is not used but is there just for the purpose of deriving the tss
as seen in the create_genes_dictionary.py:L111
As for the rest, I believe it's all in the Platform target dataset but we need to document which fields change name etc.
Thanks for the pickup @d0choa. I can regenerate tss
easily in genetics using the target index. Exons requires a new fields to come out of the ETL. When I look at the Ensembl input file that we use (taken from ftp://ftp.ensembl.org/pub/release-105/json/homo_sapiens/homo_sapiens.json) and look at the structure I see (among other fields):
root
|-- ArrayExpress: array (nullable = true)
| |-- element: string (containsNull = true)
|-- CCDS: array (nullable = true)
| |-- element: string (containsNull = true)
...
|-- taxon_id: long (nullable = true)
|-- transcripts: array (nullable = true)
| |-- element: struct (containsNull = true)
| | |-- exons: array (nullable = true)
| | | |-- element: struct (containsNull = true)
| | | | |-- end: string (nullable = true)
| | | | |-- id: string (nullable = true)
| | | | |-- seq_region_name: string (nullable = true)
| | | | |-- start: string (nullable = true)
| | | | |-- strand: string (nullable = true)
Would those be the exons
we're after?
Yes they are. To do like-for-like we just need a flattened version of them
A random example from the genetics portal gene index:
>>> gene.select("exons").printSchema()
root
|-- exons: string (nullable = true)
>>> gene.select("exons").show(1, False, True)
-RECORD 0----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
exons | [169893788,169893896,169888676,169888890,169878634,169878819,169875978,169876091,169873696,169873752,169870255,169870357,169868928,169869039,169866896,169866973,169864369,169864508,169862613,169862797,169859041,169859212,169854270,169854964,169849631,169853772]
We should probably add it properly to the Target index. Not as a string but using an array of objects with the same structure and all fields in Ensembl (e.g start
, end
, etc.).
We are going to need an intermediate step that filters the target index, renames fields, filters by biotype etc. so it's backwards compatible with the genetics portal. Eventually, we should aim to remove this intermediate step.
I'm having trouble identifying which is the correct transcript to extract the exons from. I download the json data from Ensembl and convert it to jsonl
:
wget ftp://ftp.ensembl.org/pub/release-105/json/homo_sapiens/homo_sapiens.json && jq -c '.genes[]' homo_sapiens.json >> homo_f.jsonl
When I read the file homo_f.jsonl
into Spark I get the following schema:
jarrod-jarrod@ df.select('id, 'transcripts).printSchema
root
|-- id: string (nullable = true)
|-- transcripts: array (nullable = true)
... other fields
| | |-- exons: array (nullable = true)
| | | |-- element: struct (containsNull = true)
| | | | |-- end: string (nullable = true)
| | | | |-- id: string (nullable = true)
| | | | |-- seq_region_name: string (nullable = true)
| | | | |-- start: string (nullable = true)
| | | | |-- strand: string (nullable = true)
...
transcripts
is an array, and each transcript has an array of exons
. I need to find out which transcript is the canonical transcript. Using a random gene as example you can see that one transcript is flagged as 'Ensembl canonical'.
This was previously handled in this script where there is field on the gene entry telling us the canonical transcript ID.
@DSuveges , @d0choa , please help!
PS. I've copied the raw file to gs://ot-team/jarrod/exons/homo_f.jsonl
. It's ~2.8gb so best to use a VM.
I was not very successful finding the info in the JSON, so I opened a ticket in the Ensembl helpdesk
Ensembl's response:
Currently, there is no 'canonical transcript' flag in the JSON dumps, but we plan to add this to the files from Ensembl 108, which is due in September/October 2022.
I just found out about the MANE Project and I think it is relevant: http://tark.ensembl.org/web/mane_project/
Among all the transcripts given for say COL1A1, the one tagged with MANE Select (ENST00000225964.10) indicates the most representative transcript for COL1A1. The project is particularly interesting because it harmonises Ensembl and RefSeq transcripts while ensuring perfect alignment to the GRCh38. They report in their publication 97% coverage of protein coding genes, which they will extend.
Although shown on the web, I don't see any flag on the JSON file. I'll keep an eye on their next one.
Yes this is the project that feeds the definition of the canonical transcripts in Ensembl (documentation) since last year (blog).
This info won't make it to the Ensembl JSON until Sept/Oct 2022
If we can use the gene annotation file from GENCODE: ftp://ftp.ebi.ac.uk/pub/databases/gencode/Gencode_human/release_40/gencode.v40.annotation.gff3.gz
This script provides the list of transcripts with a column flagging if a transcript is ENSEMBL canonical:
@f.udf(StructType([
StructField('gene_id', StringType(), False),
StructField('transcript_id', StringType(), False),
StructField('gene_name', StringType(), True),
StructField('gene_type', StringType(), True),
StructField('tag', StringType(), True)
]))
def parse_annotation(annotation: str) -> StructType:
"""Parsing gff3 sequence annotation to get a dictionary of values"""
columns = ['transcript_id', 'gene_id', 'gene_name', 'gene_type', 'tag']
return {
feature.split('=')[0]: feature.split('=')[1] for feature in annotation.split(';') if feature.split('=')[0] in columns
}
gff3_columns = ['strand', 'phase', 'annotation']
gff3_schema = StructType([
StructField("chr", StringType(), False),
StructField("source", StringType(), False),
StructField("featureType", StringType(), False),
StructField("start", IntegerType(), False),
StructField("end", IntegerType(), False),
StructField("score", StringType(), False),
StructField("strand", StringType(), False),
StructField("phase", StringType(), False),
StructField("annotation", StringType(), False),
])
(
spark.read.option("comment", "#")
.csv('/Users/dsuveges/Downloads/gencode.v40.annotation.gff3.gz', sep='\t', schema=gff3_schema)
# Filter for coding sequences only:
.filter(f.col('featureType') == 'transcript')
# Parsing GFF3 annotation:
.withColumn('parsed_annotation', parse_annotation(f.col('annotation')))
# Selecting columns + extract fields:
.select('chr', 'start', 'end', 'strand', 'parsed_annotation.*')
.withColumn('tag', f.split('tag', ','))
# Annotate transcripts with canonical flag. If no tag is found for a transcript, we assume it is not canonical:
.withColumn(
'isCanonical',
f.when(
f.array_contains(f.col('tag'), 'Ensembl_canonical') == True, True
).otherwise(False)
)
# Saving parquet:
.write.parquet(outFile)
)
It gives this for HER3:
+-----+--------+--------+------+------------------+-----------------+---------+--------------+-----------------------------------------------------------------+-----------+
|chr |start |end |strand|gene_id |transcript_id |gene_name|gene_type |tag |isCanonical|
+-----+--------+--------+------+------------------+-----------------+---------+--------------+-----------------------------------------------------------------+-----------+
|chr12|56076799|56083902|+ |ENSG00000065361.17|ENST00000643266.1|ERBB3 |protein_coding|[mRNA_end_NF, cds_end_NF, RNA_Seq_supported_only] |false |
|chr12|56076912|56078561|+ |ENSG00000065361.17|ENST00000643518.1|ERBB3 |protein_coding|null |false |
|chr12|56079857|56085139|+ |ENSG00000065361.17|ENST00000549282.5|ERBB3 |protein_coding|[alternative_5_UTR, mRNA_end_NF, cds_end_NF] |false |
|chr12|56079857|56086635|+ |ENSG00000065361.17|ENST00000549061.5|ERBB3 |protein_coding|[mRNA_end_NF, cds_end_NF] |false |
|chr12|56079874|56103486|+ |ENSG00000065361.17|ENST00000683059.1|ERBB3 |protein_coding|[RNA_Seq_supported_only, basic] |false |
|chr12|56080137|56085623|+ |ENSG00000065361.17|ENST00000411731.6|ERBB3 |protein_coding|[basic, CCDS] |false |
|chr12|56080139|56087154|+ |ENSG00000065361.17|ENST00000549672.6|ERBB3 |protein_coding|[RNA_Seq_supported_only] |false |
|chr12|56080142|56102804|+ |ENSG00000065361.17|ENST00000682431.1|ERBB3 |protein_coding|[RNA_Seq_supported_only] |false |
|chr12|56080165|56103505|+ |ENSG00000065361.17|ENST00000267101.8|ERBB3 |protein_coding|[basic, Ensembl_canonical, MANE_Select, appris_principal_1, CCDS]|true |
|chr12|56080167|56096769|+ |ENSG00000065361.17|ENST00000550869.5|ERBB3 |protein_coding|null |false |
|chr12|56080168|56102092|+ |ENSG00000065361.17|ENST00000551085.5|ERBB3 |protein_coding|null |false |
|chr12|56080172|56102804|+ |ENSG00000065361.17|ENST00000684500.1|ERBB3 |protein_coding|[RNA_Seq_supported_only] |false |
|chr12|56080204|56086672|+ |ENSG00000065361.17|ENST00000546884.1|ERBB3 |protein_coding|null |false |
|chr12|56080205|56102834|+ |ENSG00000065361.17|ENST00000551242.5|ERBB3 |protein_coding|null |false |
|chr12|56080347|56102804|+ |ENSG00000065361.17|ENST00000683653.1|ERBB3 |protein_coding|[RNA_Seq_supported_only] |false |
|chr12|56083313|56103486|+ |ENSG00000065361.17|ENST00000683164.1|ERBB3 |protein_coding|[RNA_Seq_supported_partial, basic] |false |
|chr12|56083319|56102336|+ |ENSG00000065361.17|ENST00000415288.6|ERBB3 |protein_coding|[basic] |false |
|chr12|56083323|56103486|+ |ENSG00000065361.17|ENST00000683018.1|ERBB3 |protein_coding|[upstream_ATG, RNA_Seq_supported_only, basic] |false |
|chr12|56086519|56088064|+ |ENSG00000065361.17|ENST00000549472.1|ERBB3 |protein_coding|null |false |
|chr12|56087824|56088868|+ |ENSG00000065361.17|ENST00000546748.1|ERBB3 |protein_coding|null |false |
|chr12|56088629|56088996|+ |ENSG00000065361.17|ENST00000551176.1|ERBB3 |protein_coding|null |false |
|chr12|56093483|56096063|+ |ENSG00000065361.17|ENST00000549205.1|ERBB3 |protein_coding|null |false |
|chr12|56093514|56093992|+ |ENSG00000065361.17|ENST00000550828.1|ERBB3 |protein_coding|[mRNA_start_NF, cds_start_NF] |false |
|chr12|56093832|56102834|+ |ENSG00000065361.17|ENST00000550070.6|ERBB3 |protein_coding|[mRNA_start_NF, cds_start_NF, dotter_confirmed] |false |
|chr12|56094522|56095527|+ |ENSG00000065361.17|ENST00000549644.1|ERBB3 |protein_coding|null |false |
|chr12|56095991|56102806|+ |ENSG00000065361.17|ENST00000553131.5|ERBB3 |protein_coding|null |false |
|chr12|56096205|56102834|+ |ENSG00000065361.17|ENST00000684766.1|ERBB3 |protein_coding|[RNA_Seq_supported_only] |false |
|chr12|56097091|56102804|+ |ENSG00000065361.17|ENST00000683142.1|ERBB3 |protein_coding|[RNA_Seq_supported_only] |false |
|chr12|56097134|56098980|+ |ENSG00000065361.17|ENST00000548709.1|ERBB3 |protein_coding|null |false |
|chr12|56097726|56102804|+ |ENSG00000065361.17|ENST00000682512.1|ERBB3 |protein_coding|[RNA_Seq_supported_only] |false |
|chr12|56098251|56103499|+ |ENSG00000065361.17|ENST00000549832.1|ERBB3 |protein_coding|[basic] |false |
|chr12|56099139|56102802|+ |ENSG00000065361.17|ENST00000682873.1|ERBB3 |protein_coding|[RNA_Seq_supported_only] |false |
|chr12|56099866|56100300|+ |ENSG00000065361.17|ENST00000552691.1|ERBB3 |protein_coding|null |false |
+-----+--------+--------+------+------------------+-----------------+---------+--------------+-----------------------------------------------------------------+-----------+
Important to note, not all transcripts has tag annotation. When such annotation is missing, I think it is safe to assume that transcript cannot be a canonical transcript.
For the record, I have checked the latest Ensembl version (Ensembl 107) and this annotation is still not available under transcripts
.
Thanks @ireneisdoomed: I'm glad I was sitting down for that. For the time being we have the logic in place to get the information for elsewhere. Once it's available from Ensembl we'll use their data.
Still within the timeline they provided. We'll see in their next release
In the latest release of Ensembl (108) we can find a LUT with the canonical transcripts: https://ftp.ensembl.org/pub/release-108/tsv/homo_sapiens/Homo_sapiens.GRCh38.108.canonical.tsv.gz
❯ head Homo_sapiens.GRCh38.108.canonical.tsv
ENSG00000210049.1 ENST00000387314.1 Ensembl Canonical
ENSG00000211459.2 ENST00000389680.2 Ensembl Canonical
ENSG00000210077.1 ENST00000387342.1 Ensembl Canonical
ENSG00000210082.2 ENST00000387347.2 Ensembl Canonical
ENSG00000209082.1 ENST00000386347.1 Ensembl Canonical
ENSG00000198888.2 ENST00000361390.2 Ensembl Canonical
ENSG00000210100.1 ENST00000387365.1 Ensembl Canonical
ENSG00000210107.1 ENST00000387372.1 Ensembl Canonical
ENSG00000210112.1 ENST00000387377.1 Ensembl Canonical
ENSG00000198763.3 ENST00000361453.3 Ensembl Canonical
❯ wc -l Homo_sapiens.GRCh38.108.canonical.tsv
88410 Homo_sapiens.GRCh38.108.canonical.tsv
We would like to add to the target object 2 new fields:
A canonicalTranscript
StringType field containing the ENST
id without the versioning appendix (.X
). E.g. ENST00000387314
. This field should not contain nulls
A canonicalExons
column that mimic the original exons schema from Ensembl but only contains the exons in the canonical transcript. This field can contain nulls.
|-- canonicalExons: array (nullable = true)
| |-- element: struct (containsNull = true)
| | |-- end: string (nullable = true)
| | |-- id: string (nullable = true)
| | |-- seq_region_name: string (nullable = true)
| | |-- start: string (nullable = true)
| | |-- strand: string (nullable = true)
We don't need the flattened version of the exons in the target object.
Adding up to what @d0choa commented above, we want to extract an additional column tss
that will be based on the canonical transcript coordinates made available in Ensembl's target json file.
The previous proposal was to add 2 new fields: canonicalTranscript
and canonicalExons
.
Since we are interested in new information, we now want to encapsulate all transcript annotation under the same struct so that the schema is as follows:
target
|-- id: string (nullable = true)
... other fields
|-- canonicalTranscript: struct (nullable = false)
| |-- id: string (nullable = true)
| |-- start: long (nullable = true)
| |-- end: long (nullable = true)
| |-- exons: array (nullable = true)
| | |-- element: struct (containsNull = true)
| | | |-- end: string (nullable = true)
| | | |-- id: string (nullable = true)
| | | |-- seq_region_name: string (nullable = true)
| | | |-- start: string (nullable = true)
| | | |-- strand: string (nullable = true)
| |-- tss: long (nullable = true)
The way to get to this is by:
MANE Plus Clinical
is a small set of additional transcripts to also account for clinically relevant variants that fall outside the MANE transcript. If we keep this source, we will have 2 canonical transcripts for ~50 genes. Do we want to filter it out? @DSuveges @d0choa transcripts
array to only include annotation from the canonical transcript. In the end you always end up with an array of size 1 that you can explode.transcript
struct to only include the fields we are interested in. We are not doing any transformation here other than selecting and adding the tss to the struct.This is a prototype of how I would do it in Spark. I hope it helps.
schema = t.StructType(
[
t.StructField("id", t.StringType(), True),
t.StructField("canonicalTranscriptId", t.StringType(), True),
t.StructField("source", t.StringType(), True),
]
)
canonical_lut = (
spark.read.csv("gs://ot-team/irene/human_canonical.tsv", sep="\t", schema=schema)
.repartition(20)
.select(
f.split("id", "\.")[0].alias("id"),
f.split("canonicalTranscriptId", "\.")[0].alias("canonicalTranscriptId"),
"source",
)
.filter(f.col("source") != "MANE Plus Clinical")
)
targets = (
spark.read.json("gs://open-targets-data-releases/22.09/input/target-inputs/ensembl/homo_sapiens.jsonl")
.select("id", "strand", "transcripts")
.repartition(20)
)
targets_with_canonical = (
targets.join(canonical_lut, on="id", how="left")
.select(
"*",
# Filter the transcripts array to keep only the canonical transcript
f.explode(f.expr("filter(transcripts, array -> (array.id == canonicalTranscriptId))")).alias(
"transcript_transformed"
),
)
.withColumn(
"tss",
f.when(f.col("strand") == 1, f.col("transcript_transformed.start")).otherwise(
f.col("transcript_transformed.end")
),
)
.select(
"id",
"strand",
f.struct(
"transcript_transformed.id",
"transcript_transformed.start",
"transcript_transformed.end",
"transcript_transformed.exons",
"tss",
).alias("canonicalTranscript"),
)
.distinct()
)
targets_with_canonical.select("id", "canonicalTranscript.*").show(5)
+---------------+---------------+---------+---------+--------------------+---------+
| id| id| start| end| exons| tss|
+---------------+---------------+---------+---------+--------------------+---------+
|ENSG00000257231|ENST00000549314|113789542|113789807|[{113789807, ENSE...|113789807|
|ENSG00000211900|ENST00000390560|105863198|105863258|[{105863258, ENSE...|105863258|
|ENSG00000200681|ENST00000363811| 18306973| 18307079|[{18307079, ENSE0...| 18307079|
|ENSG00000233885|ENST00000425008|183806457|183809515|[{183808112, ENSE...|183809515|
|ENSG00000266279|ENST00000585275| 8337697| 8338758|[{8338758, ENSE00...| 8337697|
+---------------+---------------+---------+---------+--------------------+---------+
I would keep it simple and store only one canonical per gene.
The rationale is that I randomly browsed some of the genes that would have more than one canonical transcript and the clinical information that I could extract mostly points to rare diseases, which is not our main use case for the canonical transcripts at the moment.
I agree. In that case, the logic explained above is what we want to integrate.
Just a note. I've checked the overlap between Ensembl's LUT and GENCODE's. The overlap is practically exact, however, there are some discrepancies (that might be due to differences in the version of the data):
All genes are covered by the Ensembl's LUT. A version of the latest target dataset (Ensembl 108) with these manual changes is available at gs://genetics_etl_python_playground/input/v2g_input/targets_correct_tss
completed
As a developer I want to have a unified gene model for the Genetics Portal and Platform.
Background
The gene index used by the Genetics Portal and Platform is very similar, but not quite the same. It makes sense to unify these models to avoid duplication of effort to generate, duplication of code, accumulation of discrepancies.
Tasks
Acceptance tests