Closed Pharmacogenetecist closed 7 months ago
@Pharmacogenetecist,
Thanks for the feature request!
As you may already know, the SampleTable[Results]
file (i.e. results.zip
) already provides allele fraction for each observed variant so that users can manually inspect the quality of CNV calls (See the Results interpretation page for more details). Below is an example output from the GeT-RM WGS tutorial:
$ pypgx print-data grch37-CYP2D6-pipeline/results.zip | head
Genotype Phenotype Haplotype1 Haplotype2 AlternativePhase VariantData CNV
HG00589_PyPGx *1/*21 Intermediate Metabolizer *21;*2; *1; ; *21:22-42524213-C-CG:0.378;*1:22-42522613-G-C,22-42523943-A-G:0.645,0.625;*2:default; Normal
NA07019_PyPGx *1/*4 Intermediate Metabolizer *1; *4;*10;*74;*2; ; *4:22-42524947-C-T:0.452;*10:22-42523943-A-G,22-42526694-G-A:1.0,0.448;*74:22-42525821-G-T:0.424;*1:22-42522613-G-C,22-42523943-A-G:0.361,1.0;*2:default; Normal
NA10851_PyPGx *1/*4 Intermediate Metabolizer *1; *4;*10;*74;*2; ; *4:22-42524947-C-T:0.467;*10:22-42523943-A-G,22-42526694-G-A:0.95,0.421;*74:22-42525821-G-T:0.447;*1:22-42522613-G-C,22-42523943-A-G:0.486,0.95;*2:default; Normal
NA18484_PyPGx *1/*17 Normal Metabolizer *1; *17;*2; ; *17:22-42525772-G-A:0.6;*1:22-42522613-G-C,22-42523943-A-G:0.625,0.391;*2:default; Normal
NA12006_PyPGx *4/*41 Intermediate Metabolizer *41;*2; *4;*10;*2; *69; *69:22-42526694-G-A,22-42523805-C-T:0.473,0.528;*4:22-42524947-C-T:0.448;*10:22-42523943-A-G,22-42526694-G-A:0.545,0.473;*41:22-42523805-C-T:0.528;*2:default; Normal
HG00436_PyPGx *2x2/*71 Indeterminate *71;*1; *2; ; *71:22-42526669-C-T:0.433;*1:22-42522613-G-C,22-42523943-A-G:0.462,0.353;*2:default; WholeDup1
NA19213_PyPGx *1/*1 Normal Metabolizer *1; *1; ; *1:22-42522613-G-C,22-42523943-A-G:1.0,1.0; Normal
NA19207_PyPGx *2x2/*10 Normal Metabolizer *10;*2; *2; ; *10:22-42523943-A-G,22-42526694-G-A:0.366,0.25;*2:default; WholeDup1
NA07029_PyPGx *1/*35 Normal Metabolizer *35;*2; *1; ; *1:22-42522613-G-C,22-42523943-A-G:0.596,0.476;*35:22-42526763-C-T:0.405;*2:default; Normal
If I understood your request correctly, in addition to allele fraction, you want to have allele depth in the SampleTable[Results]
file, right?
I actually have considered this before, but decided not to include allele depth because it was making the file too crowded and this information can be easily found from other sources such as VcfFrame[Consolidated]
(i.e. consolidated-variants.zip
):
#CHROM POS ID REF ALT QUAL FILTER INFO FORMAT HG00589_PyPGx
22 42512613 . A G . . Phased GT:AD:DP:AF 1|0:11,7:.:0.611,0.389
22 42512657 . A T . . . GT:AD:DP:AF:PE 0|0:22,0:.:1.000,0.000:0,0,0,0
Could you share with me your reasons to add allele depth? I'd be more than happy to think this through again.
Also, did you just want allele depth for the ALT allele or both the REF and ALT alleles? For allele fraction, we just need one value for the ALT because you can easily calculate the REF's fraction by subtracting the value from 1 (assuming the locus is biallelic).
Hello Steven, thank you for getting back to me. My reasoning is that on occasion for some samples and for some regions (from both WES/WGS) data I was getting odd calls. For example in WES - for a cohort of samples I saw multiple 1/1 or 1/3 and when I examined the VCF and then BAM file, I found these particular samples had less than 5 reads. Knowing the exact AD would be beneficial in then flagging the call.
On an unrelated note, but a feature request - Would it be possible to request that the result files are named with the Gene. So if doing CYP2D6 The multiple directories generated would have the CYP2D6_allelles.zip, CYP2D6_allele-fraction-profile, CYP2D6_results.zip. Similarly the contents of the Zip files and the .tsv and metadata file. I ask this, as when exporting specific outputs such as the data.tsv, it would be useful to know which gene they were for. Currently I am running a script to rename the contents of these files, which does not take too long, but just thought I would ask!
Thank you.
@Pharmacogenetecist,
Knowing the exact AD would be beneficial in then flagging the call.
I see what you mean. I agree that AD can be valuable information for users to have. I'd like to sit on it a bit longer to make sure I'm 100% certain what I'm doing :) But please don't forget, this information can always be found from consolidated-variants.zip
.
Would it be possible to request that the result files are named with the Gene.
This is exactly what the metadata is for! Please read the page Archive file, semantic type, and metadata. You can always look at the metadata of an archive file to see which gene it belongs to, etc.
$ pypgx print-metadata grch37-CYP2D6-results.zip
Gene=CYP2D6
Assembly=GRCh37
SemanticType=SampleTable[Results]
@Pharmacogenetecist,
Knowing the exact AD would be beneficial in then flagging the call.
I see what you mean. I agree that AD can be valuable information for users to have. I'd like to sit on it a bit longer to make sure I'm 100% certain what I'm doing :) But please don't forget, this information can always be found from
consolidated-variants.zip
.Would it be possible to request that the result files are named with the Gene.
This is exactly what the metadata is for! Please read the page Archive file, semantic type, and metadata. You can always look at the metadata of an archive file to see which gene it belongs to, etc.
$ pypgx print-metadata grch37-CYP2D6-results.zip Gene=CYP2D6 Assembly=GRCh37 SemanticType=SampleTable[Results]
Thank Steven,
WRT to the second part. As you may recall, I am currently evaluating the use of PYPGX for the analysis of 2000+exomes. I guess the issue I presented relates to "me" more, as I often want to collate the end result of all genes assessed by PYPGX.
Thank you, I will have a look at the consolidated-variants.zip file.
Hello Steven, I am wondering if it would be possible to add the allelic depths to the data.tsv output in the results.zip.
This would be useful in knowing what the read-depth at the specific position was.