Closed ielis closed 6 months ago
We should already have this functionality.
EDIT: No, we don't
but don;t we already have information about the Exons etc somewhere? I thought we wanted to import all of this data when the user is setting up the analysis rather than later on?
Yes, we don't have this. We have data with respect to a variant, which includes tx id, exon number, but nothing much more useful.
I thought we wanted to import all of this data when the user is setting up the analysis rather than later on?
We can import the data during the functional annotation, or later, but in any case we will need the functionality described by the TranscriptCoordinateService
interface. So, this ticket is to keep track of the progress on that front.
In other words, I'd first like to have this working and then find the best place where to put it within the workflow.
OK. We should probably ask the user to provide the name of the transcript they consider most important when they start the analysis.
Do you happen to know the best API command that does this? Not entirely obiouvs, if not I will look!
I think I found what we need - Variant validator seems to provide what we need:
curl -X 'GET' \
'https://rest.variantvalidator.org/VariantValidator/tools/gene2transcripts/NM_003172' \
-H 'accept: application/json'
I went through these before hitting variant validator
ENSEMBL REST API could give us what we need but it needs a stable Ensembl ID (e.g. ENSG00000012345) but we work in the RefSeq space
NCBI Rest API can take RefSeq tx ID but it does not return exon coordinates:
TX_ID=NM_003172.4 # My favorite RefSeq tx id
curl -X GET "https://api.ncbi.nlm.nih.gov/datasets/v2alpha/gene/accession/${TX_ID}?returned_content=COMPLETE&table_fields=gene-id&table_fields=gene-type&table_fields=description" \
-H "Accept: application/json"
genenames.org can give you details for a RefSeq ID but it does not provide exon coordinates:
curl -X GET "https://rest.genenames.org/fetch/refseq_accession/NM_003172" -H "Accept: application/json"
A template:
class VVTranscriptCoordinateService(TranscriptCoordinateService):
def __init__(self, gb: GenomeBuild):
"""
something like https://rest.variantvalidator.org/VariantValidator/tools/gene2transcripts/NM_000518.4
for a ClinGen HBB transcript
"""
self._gb = gb
self._url = 'https://rest.variantvalidator.org/VariantValidator/tools/gene2transcripts/%s'
def fetch(self, tx: TranscriptInfoAware) -> TranscriptCoordinates:
# create query
query = ''
contig = 'NC_000009.11'
contig = self._gb.contig_by_name(contig)
GenomicRegion()
return TranscriptCoordinates()
The example usage for now:
from genophenocorr.model.genome import GRCh38
vv = VVTranscriptCoordinateService(GRCh38)
Then, TranscriptCoordinateService
takes an instance TranscriptInfoAware
which is a thing that has gene_id
(e.g. SURF) and tx_id
(e.g. NM_003127.4 or NM_003127).
However, I'm thinking that it may be good to change the signature from:
def fetch(self, tx: TranscriptInfoAware) -> TranscriptCoordinates:
to
def fetch(self, tx: typing.Union[str, TranscriptInfoAware]) -> TranscriptCoordinates:
so that we can take a simple str
with tx_id
as well.
Implement Ensembl TranscriptCoordinateService - a way to get transcript and exon coordinates from Ensembl API.