sbslee / pypgx

A Python package for pharmacogenomics (PGx) research
https://pypgx.readthedocs.io
MIT License
66 stars 13 forks source link

Request - if possible - Allelic Depth #104

Closed Pharmacogenetecist closed 7 months ago

Pharmacogenetecist commented 1 year ago

Hello Steven, I am wondering if it would be possible to add the allelic depths to the data.tsv output in the results.zip.

This would be useful in knowing what the read-depth at the specific position was.

sbslee commented 1 year ago

@Pharmacogenetecist,

Thanks for the feature request!

As you may already know, the SampleTable[Results] file (i.e. results.zip) already provides allele fraction for each observed variant so that users can manually inspect the quality of CNV calls (See the Results interpretation page for more details). Below is an example output from the GeT-RM WGS tutorial:

$ pypgx print-data grch37-CYP2D6-pipeline/results.zip | head
    Genotype        Phenotype       Haplotype1      Haplotype2      AlternativePhase        VariantData     CNV
HG00589_PyPGx       *1/*21  Intermediate Metabolizer        *21;*2; *1;     ;       *21:22-42524213-C-CG:0.378;*1:22-42522613-G-C,22-42523943-A-G:0.645,0.625;*2:default;   Normal
NA07019_PyPGx       *1/*4   Intermediate Metabolizer        *1;     *4;*10;*74;*2;  ;       *4:22-42524947-C-T:0.452;*10:22-42523943-A-G,22-42526694-G-A:1.0,0.448;*74:22-42525821-G-T:0.424;*1:22-42522613-G-C,22-42523943-A-G:0.361,1.0;*2:default;       Normal
NA10851_PyPGx       *1/*4   Intermediate Metabolizer        *1;     *4;*10;*74;*2;  ;       *4:22-42524947-C-T:0.467;*10:22-42523943-A-G,22-42526694-G-A:0.95,0.421;*74:22-42525821-G-T:0.447;*1:22-42522613-G-C,22-42523943-A-G:0.486,0.95;*2:default;     Normal
NA18484_PyPGx       *1/*17  Normal Metabolizer      *1;     *17;*2; ;       *17:22-42525772-G-A:0.6;*1:22-42522613-G-C,22-42523943-A-G:0.625,0.391;*2:default;      Normal
NA12006_PyPGx       *4/*41  Intermediate Metabolizer        *41;*2; *4;*10;*2;      *69;    *69:22-42526694-G-A,22-42523805-C-T:0.473,0.528;*4:22-42524947-C-T:0.448;*10:22-42523943-A-G,22-42526694-G-A:0.545,0.473;*41:22-42523805-C-T:0.528;*2:default;  Normal
HG00436_PyPGx       *2x2/*71        Indeterminate   *71;*1; *2;     ;       *71:22-42526669-C-T:0.433;*1:22-42522613-G-C,22-42523943-A-G:0.462,0.353;*2:default;    WholeDup1
NA19213_PyPGx       *1/*1   Normal Metabolizer      *1;     *1;     ;       *1:22-42522613-G-C,22-42523943-A-G:1.0,1.0;     Normal
NA19207_PyPGx       *2x2/*10        Normal Metabolizer      *10;*2; *2;     ;       *10:22-42523943-A-G,22-42526694-G-A:0.366,0.25;*2:default;      WholeDup1
NA07029_PyPGx       *1/*35  Normal Metabolizer      *35;*2; *1;     ;       *1:22-42522613-G-C,22-42523943-A-G:0.596,0.476;*35:22-42526763-C-T:0.405;*2:default;    Normal

If I understood your request correctly, in addition to allele fraction, you want to have allele depth in the SampleTable[Results] file, right?

I actually have considered this before, but decided not to include allele depth because it was making the file too crowded and this information can be easily found from other sources such as VcfFrame[Consolidated] (i.e. consolidated-variants.zip):

#CHROM  POS ID  REF ALT QUAL    FILTER  INFO    FORMAT  HG00589_PyPGx
22  42512613    .   A   G   .   .   Phased  GT:AD:DP:AF 1|0:11,7:.:0.611,0.389
22  42512657    .   A   T   .   .   .   GT:AD:DP:AF:PE  0|0:22,0:.:1.000,0.000:0,0,0,0

Could you share with me your reasons to add allele depth? I'd be more than happy to think this through again.

Also, did you just want allele depth for the ALT allele or both the REF and ALT alleles? For allele fraction, we just need one value for the ALT because you can easily calculate the REF's fraction by subtracting the value from 1 (assuming the locus is biallelic).

Pharmacogenetecist commented 1 year ago

Hello Steven, thank you for getting back to me. My reasoning is that on occasion for some samples and for some regions (from both WES/WGS) data I was getting odd calls. For example in WES - for a cohort of samples I saw multiple 1/1 or 1/3 and when I examined the VCF and then BAM file, I found these particular samples had less than 5 reads. Knowing the exact AD would be beneficial in then flagging the call.

On an unrelated note, but a feature request - Would it be possible to request that the result files are named with the Gene. So if doing CYP2D6 The multiple directories generated would have the CYP2D6_allelles.zip, CYP2D6_allele-fraction-profile, CYP2D6_results.zip. Similarly the contents of the Zip files and the .tsv and metadata file. I ask this, as when exporting specific outputs such as the data.tsv, it would be useful to know which gene they were for. Currently I am running a script to rename the contents of these files, which does not take too long, but just thought I would ask!

Thank you.

sbslee commented 1 year ago

@Pharmacogenetecist,

Knowing the exact AD would be beneficial in then flagging the call.

I see what you mean. I agree that AD can be valuable information for users to have. I'd like to sit on it a bit longer to make sure I'm 100% certain what I'm doing :) But please don't forget, this information can always be found from consolidated-variants.zip.

Would it be possible to request that the result files are named with the Gene.

This is exactly what the metadata is for! Please read the page Archive file, semantic type, and metadata. You can always look at the metadata of an archive file to see which gene it belongs to, etc.

$ pypgx print-metadata grch37-CYP2D6-results.zip
Gene=CYP2D6
Assembly=GRCh37
SemanticType=SampleTable[Results]
Pharmacogenetecist commented 1 year ago

@Pharmacogenetecist,

Knowing the exact AD would be beneficial in then flagging the call.

I see what you mean. I agree that AD can be valuable information for users to have. I'd like to sit on it a bit longer to make sure I'm 100% certain what I'm doing :) But please don't forget, this information can always be found from consolidated-variants.zip.

Would it be possible to request that the result files are named with the Gene.

This is exactly what the metadata is for! Please read the page Archive file, semantic type, and metadata. You can always look at the metadata of an archive file to see which gene it belongs to, etc.

$ pypgx print-metadata grch37-CYP2D6-results.zip
Gene=CYP2D6
Assembly=GRCh37
SemanticType=SampleTable[Results]

Thank Steven,

WRT to the second part. As you may recall, I am currently evaluating the use of PYPGX for the analysis of 2000+exomes. I guess the issue I presented relates to "me" more, as I often want to collate the end result of all genes assessed by PYPGX.

Thank you, I will have a look at the consolidated-variants.zip file.