Implement VariantValidator TranscriptCoordinateService

monarch-initiative / genophenocorr

Genotype Phenotype Correlation

https://monarch-initiative.github.io/genophenocorr/stable

MIT License

4 stars 1 forks source link

Implement VariantValidator TranscriptCoordinateService #63

Closed ielis closed 6 months ago

ielis commented 9 months ago

Implement Ensembl TranscriptCoordinateService - a way to get transcript and exon coordinates from Ensembl API.

ielis commented 8 months ago

We should already have this functionality.

EDIT: No, we don't

pnrobinson commented 7 months ago

but don;t we already have information about the Exons etc somewhere? I thought we wanted to import all of this data when the user is setting up the analysis rather than later on?

ielis commented 7 months ago

Yes, we don't have this. We have data with respect to a variant, which includes tx id, exon number, but nothing much more useful.

I thought we wanted to import all of this data when the user is setting up the analysis rather than later on?

We can import the data during the functional annotation, or later, but in any case we will need the functionality described by the TranscriptCoordinateService interface. So, this ticket is to keep track of the progress on that front.

In other words, I'd first like to have this working and then find the best place where to put it within the workflow.

pnrobinson commented 7 months ago

OK. We should probably ask the user to provide the name of the transcript they consider most important when they start the analysis.

Do you happen to know the best API command that does this? Not entirely obiouvs, if not I will look!

ielis commented 7 months ago

I think I found what we need - Variant validator seems to provide what we need:

curl -X 'GET' \
  'https://rest.variantvalidator.org/VariantValidator/tools/gene2transcripts/NM_003172' \
  -H 'accept: application/json'

I went through these before hitting variant validator

ENSEMBL REST API could give us what we need but it needs a stable Ensembl ID (e.g. ENSG00000012345) but we work in the RefSeq space

NCBI Rest API can take RefSeq tx ID but it does not return exon coordinates:

TX_ID=NM_003172.4 # My favorite RefSeq tx id
curl -X GET "https://api.ncbi.nlm.nih.gov/datasets/v2alpha/gene/accession/${TX_ID}?returned_content=COMPLETE&table_fields=gene-id&table_fields=gene-type&table_fields=description" \
-H "Accept: application/json"

genenames.org can give you details for a RefSeq ID but it does not provide exon coordinates:

curl -X GET "https://rest.genenames.org/fetch/refseq_accession/NM_003172" -H "Accept: application/json"

ielis commented 7 months ago

A template:


class VVTranscriptCoordinateService(TranscriptCoordinateService):

    def __init__(self, gb: GenomeBuild):
        """
        something like https://rest.variantvalidator.org/VariantValidator/tools/gene2transcripts/NM_000518.4
        for a ClinGen HBB transcript
        """
        self._gb = gb
        self._url = 'https://rest.variantvalidator.org/VariantValidator/tools/gene2transcripts/%s'

    def fetch(self, tx: TranscriptInfoAware) -> TranscriptCoordinates:
        # create query
        query = ''

        contig = 'NC_000009.11'
        contig = self._gb.contig_by_name(contig)
        GenomicRegion()
        return TranscriptCoordinates()

ielis commented 7 months ago

The example usage for now:

from genophenocorr.model.genome import GRCh38

vv = VVTranscriptCoordinateService(GRCh38)

Then, TranscriptCoordinateService takes an instance TranscriptInfoAware which is a thing that has gene_id (e.g. SURF) and tx_id (e.g. NM_003127.4 or NM_003127).

However, I'm thinking that it may be good to change the signature from:

def fetch(self, tx: TranscriptInfoAware) -> TranscriptCoordinates:

def fetch(self, tx: typing.Union[str, TranscriptInfoAware]) -> TranscriptCoordinates:

so that we can take a simple str with tx_id as well.