Retrieving genes and transcript data by genome location

nevace commented 4 years ago

I’m looking at using the VariantValidator API to retrieve gene data to be consumed by an external tool or service (i.e genome browser software). The tool I have in mind is JBrowse, but the data should be generic enough to fulfil the requirements of many use cases.

I’d like to create an endpoint that receives the genome build, chromosome id and start/end points of the genome and responds with a list of genes that are located there, including genes that aren’t fully positioned inside that location.

I imagine the endpoint looking something like this as an example:

GET: /tools/genesbylocation/{genome_build}/{chromosome_id}/{genome_location}

Example:

GET: /tools/genesbylocation/hg19/chr17/48100000-48200000

or

GET: /tools/genesbylocation/{genome_build}/{chromosome_id}/{genome_location_start}/{genome_location_end}

Example:

GET: /tools/genesbylocation/hg19/chr17/48100000/48200000

I’m not an expert in this area so I’m not sure what data would be the most helpful to provide for each gene. As a starting point I think the properties below could be useful, but would welcome input from anyone that thinks they could add to these.

[{
  "name": "protein phosphatase 1 regulatory subunit 9B",
  "symbol": "PPP1R9B",
  "genomic_position": {
    "chr": "17",
    "end": 50150677,
    "start": 50133737,
    "strand": -1
  },
  "exons": [{
    "cdsend": 50150513,
    "cdsstart": 50135330,
    "chr": "17",
    "position": [
      [50133736, 50135384],
      [50135552, 50135649],
      [50135967, 50136197],
      [50139262, 50139316],
      [50139428, 50139581],
      [50140092, 50140228],
      [50141268, 50141373],
      [50143597, 50143718],
      [50145112, 50145245],
      [50149142, 50150677]
    ],
    "strand": -1,
    "transcript": "NM_032595",
    "txend": 50150677,
    "txstart": 50133736
  }]
}]

Thanks

Peter-J-Freeman commented 4 years ago

hi @nevace

I think it makes sense to initially build this functionality into the VariantValidator Python library so that the API can access it, but it will also be accessible to command line tools.

The UTA database and hgvs python libraries can perform these tasks.

A couple of corrections to your structure

GET: /tools/genesbylocation/{genome_build}/{chromosome_id}/{genome_location}

Example:

GET: /tools/genesbylocation/hg19/chr17/48100000-48200000

or

GET: /tools/genesbylocation/{genome_build}/{chromosome_id}/{genome_location_start}/{genome_location_end}

Example:

GET: /tools/genesbylocation/hg19/chr17/48100000/48200000 This example is fine, but I would also like to see the tool accept a Chromosome accession as well as chr17.

This module contains the necessary transformation dictionaries seq_data.py

[{
  "name": "protein phosphatase 1 regulatory subunit 9B",
  "symbol": "PPP1R9B",
  "hgnc": "HGNC:.....:,

look in get_stable_gene_id_info

  "genomic_position": {
    "chr": "17",
    "build": "GRCh37",
    "accession": "NC_......",
    "end": 50150677,
    "start": 50133737,
  },
  "exons": [{
    "cdsend": 50150513,
    "cdsstart": 50135330,
    "chr": "NC_....",
    "position": [
      [50133736, 50135384, Exon_number],
      [50135552, 50135649, Exon_number],
      [50135967, 50136197, Exon_number],
      [50139262, 50139316, Exon_number],
      [50139428, 50139581, Exon_number],
      [50140092, 50140228, Exon_number],
      [50141268, 50141373, Exon_number],
      [50143597, 50143718, Exon_number],
      [50145112, 50145245, Exon_number],
      [50149142, 50150677, Exon_number]
    ],
    "strand": -1,
    "transcript": "NM_032595.<VERSION>",
    "txend": 50150677,
    "txstart": 50133736,
    "sequence": "......",
    "complement": "....." 
  }]
}]

I will have a think about the workflow and get back to you. For now, clone the repo and check out the develop_v3 branch. Then create your own working branch from develop_v3

Peter-J-Freeman commented 4 years ago

A few comments on workflow as I get time.

Step 1 is to convert chr17 into a valid Chromosome ID. See above.

Step 2 is to get all the transcripts that map to the region. VV connects to UTA.

Look at the VV user manual. You create the VV object

validator = VariantValidator.Validator()

which has a connection to UTA

The calls to UTA include a get_tx_for_region function.

There is also a get_tx_for_gene function.

At this stage I think we need to discuss the outputs and API search options here. For example, I think we need a Seqrch by gene API call that produces this data set as well as a search by region. I also think that the get_tx_for_region function will only return transcripts that are completely within the genomic region, so perhaps we add several KB to the coordinates before the UTA search

nevace commented 4 years ago

Thanks for the info, all makes sense.

So the "search by gene" API would provide the response above as a single object (given the gene as a parameter) and the "search by region" API would respond with a list of the gene objects but based on region parameters?

I'm assuming the "search by gene" API would accept the same gene_symbol parameter as used here:

/tools/gene2transcripts/{gene_symbol}

Peter-J-Freeman commented 4 years ago

That's right.

Basically the returned object will be the same, but the entry point will differ. Remember, a genomic span could have more than one gene in it.

Get VV set up and have a play. Let me know if you have any issues. We can then Skype to plan the next stages and comment here

openvar / variantValidator

Retrieving genes and transcript data by genome location #108