Closed ielis closed 8 months ago
There are phenopackets with structural variants for which we only have the label, and not the contig. This is going to be the case for everything that was identified before GS. I do not think that the CNVs should be required to have the contig.
Can you please include an example of these cases?
We may run into issues with such variants. I am not sure how to perform functional annotation without contig info. I think VEP won't talk to us..
Related to #120
We need to handle the bug #83 where ingest fails due to a variant on hg19 genome build while the app uses hg38.
Background
The
PhenopacketVariantCoordinateFinder
is responsible for turningGenomicInterpretation
from Phenopacket Schema intoVariantCoordinates
. Thevcf_record
field of theGenomicInterpretation
has agenome_assembly
subfield that should contain the build of the variant in a usable format.In case of CNVs that use VRS elements, we can use the
sequence_id
to test if we're on the right build. The example in phenopacket docs lists an allele with asequence_id==NC_000010.11
. The RefSeq identifier corresponds tochr10
in GRCh38.p13 build. We know this based on the assembly report tables that are in our code base. Upon inspection of both tables, we can only find the corresponding contig in GRCh38.p13 (chr10
in GRCh37.p13 corresponds toNC_000010.10
, note the difference in version).PhenopacketVariantCoordinateFinder
, the parsing code, knows aboutGenomeBuild
(fieldself._build
) which has anidentifier
property. The property has the following values{'GRCh37.p13', 'GRCh38.p13'}
. Therefore, we can match theidentifier
with variant's build to check that the variant uses the right build.Definition of done
PhenopacketVariantCoordinateFinder
matches thegenome_assembly
field to genome build'sidentifier
and raises an exception if the assemblies don't match. We can be permissive in value matching:GenomeBuild.identifier
GRCh37.p13
grch37
,GRCh37
,GRCh37.p13
,hg19
,HG19
, ...GRCh38.p13
grch38
,GRCh38
,GRCh38.p13
,hg38
,HG38
, ...PhenopacketVariantCoordinateFinder
raises an exception if thesequence_id
is not among contigs of the build. Here, the match must be exact.ValueError
with a helpful error message. Note, that in thePhenopacketVariantCoordinateFinder
code we may not have enough context to, e.g. report the id of the offending phenopacket. Therefore, we must catch any errors in an upstream class and re-raise with a better message.