Closed hammer closed 2 years ago
The significance of this file is that it lists genes that occur on the non reference parts of the human genome assembly, so that we can exclude them. For instance there are up to 8 versions of the assembly of the human HLA/MHC region only one of which is the reference version. In each of those assemblies each copy of a gene gets a different ENSG ID. So for instance for the Complement Factor B gene there are the following versions of the gene
CFB ENSG00000243649 GRCh38 6 t CFB ENSG00000241534 GRCh38 CHR_HSCHR6_MHC_DBB_CTG1 f CFB ENSG00000242335 GRCh38 CHR_HSCHR6_MHC_SSTO_CTG1 f CFB ENSG00000243570 GRCh38 CHR_HSCHR6_MHC_MCF_CTG1 f CFB ENSG00000241253 GRCh38 CHR_HSCHR6_MHC_QBL_CTG1 f CFB ENSG00000204359 GRCh38 CHR_HSCHR6_MHC_MANN_CTG1 f CFB ENSG00000239754 GRCh38 CHR_HSCHR6_MHC_COX_CTG1 f
We want to associate evidence only with one of the ENSG IDs, the reference version ENSG00000243649 otherwise we would have evidence split across multiple target pages. This file provides the information to deal with this.
Not sure how it is generated but @mkarmona will know more.
If you are thinking hard about this you might say that these are actually different genes because they come from alternate haplotypes, and indeed for some of the genes they only occur on certain versions of the sequence, or vary between sequence. However for mos tof the cases the evidence we receive doesn't know about that complication, and we simplify to resolve the issue of split evidence.
Resolving the issue of variant versions of genes in the platform would probably be better solved by altering the model
Thanks @iandunham! Our lab has previously worked on HLA typing (https://github.com/hammerlab/prohlatype) so we are quite familiar with these highly polymorphic regions of the reference.
@mkarmona I'm looking forward to learning how y'all generate this file. It seems to me that to be consistent we should probably thread it through the pipeline by storing on Google Cloud and pointing to the location in mrtarget.data.yml
. It would be nice to have the script that generates the file up on GitHub similar to https://github.com/opentargets/platform-input-support and https://github.com/opentargets/evidence_datasource_parsers.
Although the motivation for having these extra sequences of HLA was to understand the genomic structure of the HLA, they only represent a fraction of the HLA types (as you know). However there are also some other regions of the genome that have these non reference assemblies, so really it's a partial attempt to deal with some of the genome variation.
For historical purposes see this paper :-) https://www.pnas.org/content/pnas/84/20/7237.full.pdf
For historical purposes see this paper :-) https://www.pnas.org/content/pnas/84/20/7237.full.pdf
Ha! We've both got some battle scars from HLA.
We spent a lot of time understanding approaches to working with a graph rather than linear reference when working on prohlatype
so I can certainly appreciate the complexities that downstream tooling like Ensembl, UniProt, and ultimately Open Targets will soon have to handle.
@apierleoni perhaps you can shed some light on the origin of this file? I'm sure it's a standard download from the GRC, I'm just curious to know precisely where it's from.
@d0choa why was this issue closed? The question has still not been answered, as far as I can tell.
Sorry, my fault. I think we all understand the biological background behind this file. Unfortunately, we don't have the code that generated genes_with_non_reference_ensembl_ids.tsv
in the first place. I would assume, it was retrieved from the Ensembl API / MySQL / Biomart directly. If you visit Ensembl in any of the problematic genes (e.g. ABCB11) you would get a "View alleles of this gene on alternative sequences" message. @AsierGonzalez will explore if we can get some code to do it automatically. It will be low priority, though
Thanks! Not high priority for us either, I just like fully reproducible data pipelines, as you know, so would like to know how to reproduce these file contents from a public data repository if possible.
A first quick exploration of the data points to the Ensembl database / API as the best place to get this information:
mysql -hensembldb.ensembl.org -uanonymous homo_sapiens_core_99_38 -e "SELECT stable_id, attrib, description FROM alt_allele INNER JOIN gene USING (gene_id) INNER JOIN alt_allele_attrib USING (alt_allele_id)"
Thanks everyone for this discussion, very helpful for me as I'm just starting to explore ensembl genes on alternative sequences.
It sounds like Open Targets uses the following approach to deal with non-reference genes:
ENSG00000196101
)Is that correct?
Now regarding genes_with_non_reference_ensembl_ids.tsv
(commit-versioned link), here is the head of this table:
gene_symbol | ensembl_gene_id | assembly | chromosome | is_reference |
---|---|---|---|---|
ABCB11 | ENSG00000073734 | GRCh38 | 2 | t |
ABCB11 | ENSG00000276582 | GRCh38 | CHR_HSCHR2_1_CTG7_2 | f |
AC130343.1 | ENSG00000274119 | GRCh38 | CHR_HSCHR17_1_CTG2 | f |
AC203639.1 | ENSG00000276973 | GRCh38 | CHR_HSCHR13_1_CTG3 | f |
Thanks @d0choa for the SQL query. Running it in Python (with minor edits) here's what I get:
query = '''
SELECT stable_id, alt_allele_group_id, attrib, description
FROM alt_allele
INNER JOIN gene
USING (gene_id)
INNER JOIN alt_allele_attrib
USING (alt_allele_id)
LIMIT 5
'''
import pandas as pd
pd.read_sql(
sql=query,
con="mysql://anonymous@ensembldb.ensembl.org/homo_sapiens_core_102_38",
).convert_dtypes()
stable_id | alt_allele_group_id | attrib | description |
---|---|---|---|
ENSG00000282572 | 44429 | AUTOMATICALLY_ASSIGNED | novel transcript similar to family member with... |
ENSG00000281951 | 44429 | AUTOMATICALLY_ASSIGNED | novel transcript similar to family member with... |
ENSG00000273644 | 44430 | AUTOMATICALLY_ASSIGNED | von Hippel-Lindau binding protein 1 (VBP1) pse... |
ENSG00000282333 | 44430 | AUTOMATICALLY_ASSIGNED | von Hippel-Lindau binding protein 1 (VBP1) pse... |
ENSG00000232325 | 44431 | AUTOMATICALLY_ASSIGNED | novel transcript |
I'm new to the ensembl database schema (diagram). So I've got a couple questions about where the following data comes from:
gene_symbol
assembly
and chromosome
is_reference
: is this stored in the ensembl database somewhere, is it computed from attrib
somehow, or by looking if chromosome
is in the values 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, X, Y, MT?I think I figured out how to get chromosome information
SELECT
stable_id,
biotype,
version,
alt_allele_group_id,
attrib,
seq_region.name AS chromosome,
seq_region.name REGEXP '^([0-9]{1,2}|X|Y|MT)$' AS is_reference
# seq_region.name IN ('1','2','3','4','5','6','7','8','9','10','11','12','13','14','15','16','17','18','19','20','21','22','X','Y','MT') as is_reference
# SELECT *
FROM gene
JOIN seq_region ON seq_region.seq_region_id=gene.seq_region_id
JOIN alt_allele ON alt_allele.gene_id=gene.gene_id
JOIN alt_allele_attrib ON alt_allele_attrib.alt_allele_id=alt_allele.alt_allele_id
LIMIT 5;
stable_id | biotype | version | alt_allele_group_id | attrib | chromosome | is_reference |
---|---|---|---|---|---|---|
ENSG00000282572 | lncRNA | 2 | 44429 | AUTOMATICALLY_ASSIGNED | 7 | 1 |
ENSG00000281951 | lncRNA | 1 | 44429 | AUTOMATICALLY_ASSIGNED | CHR_HSCHR7_2_CTG1 | 0 |
ENSG00000273644 | processed_pseudogene | 1 | 44430 | AUTOMATICALLY_ASSIGNED | 7 | 1 |
ENSG00000282333 | processed_pseudogene | 1 | 44430 | AUTOMATICALLY_ASSIGNED | CHR_HSCHR7_2_CTG1 | 0 |
As far as I can tell, the file doesn't actually have any use in the pipeline.
The file is used in two places:
mrtarget/common/EvidenceString.py:520: for line in file(file_or_resource('genes_with_non_reference_ensembl_ids.tsv')):
mrtarget/common/LookupHelpers.py:60: for line in open(file_or_resource('genes_with_non_reference_ensembl_ids.tsv')):
In EvidenceString.py
the method which reads the file is never called.
LookupHelpers
is used by down-stream phases and is effectively an Elasticsearch
cache. The method in LookupHelpers
and creates a field lookup.non_reference_genes
.
The only place lookup.non_reference_genes
is used is in Evidence.py
, and the only
outcome of that is to log a warning if one of the Ensembl Ids is encountered. Having looked through the last logging outputs, that part of the code is never hit.
Further, the only two fields used from the file are ensembl_gene_id
and is_reference
.
The purpose of this the file in question is replaced by the logic in Ensembl.py
with the method _clean_non_reference_gene
. This is also largely redundant, as the input file coming from the Platform Input Support program has already removed all non-reference genes.
One anomaly is that at least one gene (ENSG00000273716) which should have been removed (according to the logic of the tsv file) is present in the final index. ENSG00000273716 is a pseudogene and probably should be excluded as there is no useful information on it.
A following ticket will discuss removing that gene based on a (configurable?) biotype filter.
Reopening again and assigning to me, until we make a decision on what genes should be included in the gene/target index based on this and other opened issues.
Hardcoded file is no longer used.
Documentation for target (gene) inclusion in the platform is here: https://platform-docs.opentargets.org/target
Where does https://github.com/opentargets/data_pipeline/blob/master/mrtarget/resources/genes_with_non_reference_ensembl_ids.tsv come from, and what is its significance?