opentargets / issues

Issue tracker for Open Targets Platform and Open Targets Genetics Portal
https://platform.opentargets.org https://genetics.opentargets.org
Apache License 2.0
12 stars 2 forks source link

Origin of genes_with_non_reference_ensembl_ids.tsv? #702

Closed hammer closed 2 years ago

hammer commented 5 years ago

Where does https://github.com/opentargets/data_pipeline/blob/master/mrtarget/resources/genes_with_non_reference_ensembl_ids.tsv come from, and what is its significance?

iandunham commented 5 years ago

The significance of this file is that it lists genes that occur on the non reference parts of the human genome assembly, so that we can exclude them. For instance there are up to 8 versions of the assembly of the human HLA/MHC region only one of which is the reference version. In each of those assemblies each copy of a gene gets a different ENSG ID. So for instance for the Complement Factor B gene there are the following versions of the gene

CFB ENSG00000243649 GRCh38 6 t CFB ENSG00000241534 GRCh38 CHR_HSCHR6_MHC_DBB_CTG1 f CFB ENSG00000242335 GRCh38 CHR_HSCHR6_MHC_SSTO_CTG1 f CFB ENSG00000243570 GRCh38 CHR_HSCHR6_MHC_MCF_CTG1 f CFB ENSG00000241253 GRCh38 CHR_HSCHR6_MHC_QBL_CTG1 f CFB ENSG00000204359 GRCh38 CHR_HSCHR6_MHC_MANN_CTG1 f CFB ENSG00000239754 GRCh38 CHR_HSCHR6_MHC_COX_CTG1 f

We want to associate evidence only with one of the ENSG IDs, the reference version ENSG00000243649 otherwise we would have evidence split across multiple target pages. This file provides the information to deal with this.

Not sure how it is generated but @mkarmona will know more.

If you are thinking hard about this you might say that these are actually different genes because they come from alternate haplotypes, and indeed for some of the genes they only occur on certain versions of the sequence, or vary between sequence. However for mos tof the cases the evidence we receive doesn't know about that complication, and we simplify to resolve the issue of split evidence.

Resolving the issue of variant versions of genes in the platform would probably be better solved by altering the model

hammer commented 5 years ago

Thanks @iandunham! Our lab has previously worked on HLA typing (https://github.com/hammerlab/prohlatype) so we are quite familiar with these highly polymorphic regions of the reference.

@mkarmona I'm looking forward to learning how y'all generate this file. It seems to me that to be consistent we should probably thread it through the pipeline by storing on Google Cloud and pointing to the location in mrtarget.data.yml. It would be nice to have the script that generates the file up on GitHub similar to https://github.com/opentargets/platform-input-support and https://github.com/opentargets/evidence_datasource_parsers.

iandunham commented 5 years ago

Although the motivation for having these extra sequences of HLA was to understand the genomic structure of the HLA, they only represent a fraction of the HLA types (as you know). However there are also some other regions of the genome that have these non reference assemblies, so really it's a partial attempt to deal with some of the genome variation.

For historical purposes see this paper :-) https://www.pnas.org/content/pnas/84/20/7237.full.pdf

hammer commented 5 years ago

For historical purposes see this paper :-) https://www.pnas.org/content/pnas/84/20/7237.full.pdf

Ha! We've both got some battle scars from HLA.

We spent a lot of time understanding approaches to working with a graph rather than linear reference when working on prohlatype so I can certainly appreciate the complexities that downstream tooling like Ensembl, UniProt, and ultimately Open Targets will soon have to handle.

hammer commented 5 years ago

@apierleoni perhaps you can shed some light on the origin of this file? I'm sure it's a standard download from the GRC, I'm just curious to know precisely where it's from.

hammer commented 4 years ago

@d0choa why was this issue closed? The question has still not been answered, as far as I can tell.

d0choa commented 4 years ago

Sorry, my fault. I think we all understand the biological background behind this file. Unfortunately, we don't have the code that generated genes_with_non_reference_ensembl_ids.tsv in the first place. I would assume, it was retrieved from the Ensembl API / MySQL / Biomart directly. If you visit Ensembl in any of the problematic genes (e.g. ABCB11) you would get a "View alleles of this gene on alternative sequences" message. @AsierGonzalez will explore if we can get some code to do it automatically. It will be low priority, though

hammer commented 4 years ago

Thanks! Not high priority for us either, I just like fully reproducible data pipelines, as you know, so would like to know how to reproduce these file contents from a public data repository if possible.

d0choa commented 4 years ago

A first quick exploration of the data points to the Ensembl database / API as the best place to get this information:

mysql -hensembldb.ensembl.org -uanonymous homo_sapiens_core_99_38 -e "SELECT stable_id, attrib, description FROM alt_allele INNER JOIN gene USING (gene_id) INNER JOIN alt_allele_attrib USING (alt_allele_id)"
dhimmel commented 3 years ago

Thanks everyone for this discussion, very helpful for me as I'm just starting to explore ensembl genes on alternative sequences.

It sounds like Open Targets uses the following approach to deal with non-reference genes:

Is that correct?

Now regarding genes_with_non_reference_ensembl_ids.tsv (commit-versioned link), here is the head of this table:

gene_symbol ensembl_gene_id assembly chromosome is_reference
ABCB11 ENSG00000073734 GRCh38 2 t
ABCB11 ENSG00000276582 GRCh38 CHR_HSCHR2_1_CTG7_2 f
AC130343.1 ENSG00000274119 GRCh38 CHR_HSCHR17_1_CTG2 f
AC203639.1 ENSG00000276973 GRCh38 CHR_HSCHR13_1_CTG3 f

Thanks @d0choa for the SQL query. Running it in Python (with minor edits) here's what I get:

query = '''
SELECT stable_id, alt_allele_group_id, attrib, description
FROM alt_allele
INNER JOIN gene
USING (gene_id)
INNER JOIN alt_allele_attrib
USING (alt_allele_id)
LIMIT 5
'''
import pandas as pd
pd.read_sql(
    sql=query,
    con="mysql://anonymous@ensembldb.ensembl.org/homo_sapiens_core_102_38",
).convert_dtypes()
stable_id alt_allele_group_id attrib description
ENSG00000282572 44429 AUTOMATICALLY_ASSIGNED novel transcript similar to family member with...
ENSG00000281951 44429 AUTOMATICALLY_ASSIGNED novel transcript similar to family member with...
ENSG00000273644 44430 AUTOMATICALLY_ASSIGNED von Hippel-Lindau binding protein 1 (VBP1) pse...
ENSG00000282333 44430 AUTOMATICALLY_ASSIGNED von Hippel-Lindau binding protein 1 (VBP1) pse...
ENSG00000232325 44431 AUTOMATICALLY_ASSIGNED novel transcript

I'm new to the ensembl database schema (diagram). So I've got a couple questions about where the following data comes from:

other resources

dhimmel commented 3 years ago

I think I figured out how to get chromosome information

SELECT
  stable_id,
  biotype,
  version,
  alt_allele_group_id,
  attrib,
  seq_region.name AS chromosome,
  seq_region.name REGEXP '^([0-9]{1,2}|X|Y|MT)$' AS is_reference
  # seq_region.name IN ('1','2','3','4','5','6','7','8','9','10','11','12','13','14','15','16','17','18','19','20','21','22','X','Y','MT') as is_reference
# SELECT *
FROM gene
JOIN seq_region ON seq_region.seq_region_id=gene.seq_region_id
JOIN alt_allele ON alt_allele.gene_id=gene.gene_id
JOIN alt_allele_attrib ON alt_allele_attrib.alt_allele_id=alt_allele.alt_allele_id
LIMIT 5;
stable_id biotype version alt_allele_group_id attrib chromosome is_reference
ENSG00000282572 lncRNA 2 44429 AUTOMATICALLY_ASSIGNED 7 1
ENSG00000281951 lncRNA 1 44429 AUTOMATICALLY_ASSIGNED CHR_HSCHR7_2_CTG1 0
ENSG00000273644 processed_pseudogene 1 44430 AUTOMATICALLY_ASSIGNED 7 1
ENSG00000282333 processed_pseudogene 1 44430 AUTOMATICALLY_ASSIGNED CHR_HSCHR7_2_CTG1 0
JarrodBaker commented 3 years ago

As far as I can tell, the file doesn't actually have any use in the pipeline.

The file is used in two places:

mrtarget/common/EvidenceString.py:520:        for line in file(file_or_resource('genes_with_non_reference_ensembl_ids.tsv')):
mrtarget/common/LookupHelpers.py:60:        for line in open(file_or_resource('genes_with_non_reference_ensembl_ids.tsv')):

In EvidenceString.py the method which reads the file is never called.

LookupHelpers is used by down-stream phases and is effectively an Elasticsearch cache. The method in LookupHelpers and creates a field lookup.non_reference_genes.

The only place lookup.non_reference_genes is used is in Evidence.py, and the only outcome of that is to log a warning if one of the Ensembl Ids is encountered. Having looked through the last logging outputs, that part of the code is never hit.

Further, the only two fields used from the file are ensembl_gene_id and is_reference.

JarrodBaker commented 3 years ago

The purpose of this the file in question is replaced by the logic in Ensembl.py with the method _clean_non_reference_gene. This is also largely redundant, as the input file coming from the Platform Input Support program has already removed all non-reference genes.

One anomaly is that at least one gene (ENSG00000273716) which should have been removed (according to the logic of the tsv file) is present in the final index. ENSG00000273716 is a pseudogene and probably should be excluded as there is no useful information on it.

A following ticket will discuss removing that gene based on a (configurable?) biotype filter.

d0choa commented 3 years ago

Reopening again and assigning to me, until we make a decision on what genes should be included in the gene/target index based on this and other opened issues.

d0choa commented 2 years ago

Hardcoded file is no longer used.

Documentation for target (gene) inclusion in the platform is here: https://platform-docs.opentargets.org/target

ETL Code: https://github.com/opentargets/platform-etl-backend/blob/f8f1f62cfb590081ac14b55a2ef147e92fc9da76/src/main/scala/io/opentargets/etl/backend/target/Ensembl.scala#L31