monarch-initiative / gpsea

A Python library for discovery of genotype-phenotype associations
https://monarch-initiative.github.io/gpsea/stable
MIT License
5 stars 1 forks source link

POLR1A: Unable to recognise gene symbol LOC90784 #263

Open pnrobinson opened 2 months ago

pnrobinson commented 2 months ago

This call

protein_meta = pms.annotate(POLR1A_protein_id)

leads to this error

{
    "name": "ValueError",
    "message": "A required `transcripts` field is missing in the response from Variant Validator API: 
{
  \"error\": \"Unable to recognise gene symbol LOC90784\",
  \"requested_symbol\": \"NM_015425.6\"
}",
    "stack": "---------------------------------------------------------------------------
ValueError                                Traceback (most recent call last)
Cell In[7], line 5
      3 txc_service = VVMultiCoordinateService(genome_build=GRCh38)
      4 pms = configure_protein_metadata_service()
----> 5 tx_coordinates = txc_service.fetch(POLR1A_MANE_transcript)
      6 protein_meta = pms.annotate(POLR1A_protein_id)

File ~/GIT/gpsea/src/gpsea/preprocessing/_vv.py:164, in VVMultiCoordinateService.fetch(self, tx)
    162 tx_id = self._parse_tx(tx)
    163 response_json = self.get_response(tx_id)
--> 164 return self.parse_response(tx_id, response_json)

File ~/GIT/gpsea/src/gpsea/preprocessing/_vv.py:195, in VVMultiCoordinateService.parse_response(self, tx_id, response)
    193     raise ValueError(error_string)
    194 if 'transcripts' not in transcript_response:
--> 195     VVMultiCoordinateService._handle_missing_field(
    196         response=response, 
    197         field='transcripts',
    198     )
    199 tx_data = self._find_tx_data(tx_id, transcript_response['transcripts'])
    200 if 'genomic_spans' not in tx_data:

File ~/GIT/gpsea/src/gpsea/preprocessing/_vv.py:259, in VVMultiCoordinateService._handle_missing_field(response, field)
    257 json_formatted_str = json.dumps(response, indent=2)
    258 error_string = f\"A required `{field}` field is missing in the response from Variant Validator API: \
{json_formatted_str}\"
--> 259 raise ValueError(error_string)

ValueError: A required `transcripts` field is missing in the response from Variant Validator API: 
{
  \"error\": \"Unable to recognise gene symbol LOC90784\",
  \"requested_symbol\": \"NM_015425.6\"
}"
}

@ielis Doing this sort of thing via API does seem to be a bit instable. Rather than throwing a ValueError, it might be better simply to print a message for the user such as

We are unable to obtain information about the transcript and protein structure via API.
 It will not be possible to create a graphic showing the positions of variants on the protein. 
Other functionality of GPSEA is not affected. 
Consider using the following code to create a table of all variants 
(then show the code for creating the all variants table)
pnrobinson commented 2 months ago

We could in principle also allow the user to enter the information to create the protein graphic by hand. This would be a nice feature especially since UniProt does not always have all of the relevant protein domains. I think we basically just need something like this

Protein ID: ...
Length: ...
Feature 1: 32-56
Feature 2: 77-81
Feature 3: ...

and this could be created using Excel probably and then ingest as a pandas DataFrame.