sbslee / pypgx

A Python package for pharmacogenomics (PGx) research
https://pypgx.readthedocs.io
MIT License
66 stars 13 forks source link

Differences in called star alleles and phenotypes from the same samples: VCF vs BAM as input #123

Closed NourMarzouka closed 8 months ago

NourMarzouka commented 8 months ago

image

I have more than 100 samples. When I use the VCFs (generated by Dragen/Illumina) as input it gives simple output. (right plot) While if I use the bam files as input and let PyPGx do the calling I get much more variants and variability in the outputs regarding star alleles and phenotype (left plot).

Do you have any explanation for this, please? Which caller should I trust?

I used CYP2C19 to avoid any SV interference in the analysis. I have the same observation I found for CYP2D6 where SVs are involved.

Another question, is the ALt-contig aware calling required if the person wants to start with a VCF file?

Thanks!

sbslee commented 8 months ago

@NourMarzouka,

Please refer to the FAQ on Variant caller choice for my general answer.

That being said, I agree that the difference is quite alarming. What's the source of your data? Is this WGS or targeted sequencing? Are these tumor samples or germline samples?

Clearly, the two variant calling approaches are producing vastly different genomic variation landscapes that cannot be explained solely by a choice of variant caller. I think something more fundamental is going on here.

The ALT contigs are only relevant for read alignment for GRCh38 data. For more information, please refer to the GRCh37 vs. GRCh38 page.

NourMarzouka commented 8 months ago

They are germline WGS with good coverage.

On Tue, 12 Mar 2024, 9:44 am sbslee, @.***> wrote:

@NourMarzouka https://github.com/NourMarzouka,

Please refer to the FAQ on Variant caller choice https://pypgx.readthedocs.io/en/latest/faq.html#variant-caller-choice for my general answer.

That being said, I agree that the difference is quite alarming. What's the source of your data? Is this WGS or targeted sequencing? Are these tumor samples or germline samples?

Clearly, the two variant calling approaches are producing vastly different genomic variation landscapes that cannot be explained solely by a choice of variant caller. I think something more fundamental is going on here.

The ALT contigs are only relevant for read alignment for GRCh38 data. For more information, please refer to the GRCh37 vs. GRCh38 https://pypgx.readthedocs.io/en/latest/readme.html#grch37-vs-grch38 page.

— Reply to this email directly, view it on GitHub https://github.com/sbslee/pypgx/issues/123#issuecomment-1990640130, or unsubscribe https://github.com/notifications/unsubscribe-auth/ACE57Z5JNTINLBS4HPYCXMDYX2I4RAVCNFSM6AAAAABEPXRUCWVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMYTSOJQGY2DAMJTGA . You are receiving this because you were mentioned.Message ID: @.***>

sbslee commented 8 months ago

@NourMarzouka,

Is your data GRCh37 or GRCh38? Also, have you looked at the actual variant(s) that differ between the two approaches? My guess is that the left plot has more variants observed. If you take a look at those under IGV, do they look real? This kind of investigation can reveal what could be wrong.

NourMarzouka commented 8 months ago

They are hg38. I will investigate further the detected snps. And yes PyPGx detected more than dragen.

On Tue, 12 Mar 2024, 10:04 am sbslee, @.***> wrote:

@NourMarzouka https://github.com/NourMarzouka,

Is your data GRCh37 or GRCh38? Also, have you looked at the actual variant(s) that differ between the two approaches? My guess is that the left plot has more variants observed. If you take a look at those under IGV, do they look real? This kind of investigation can reveal what could be wrong.

— Reply to this email directly, view it on GitHub https://github.com/sbslee/pypgx/issues/123#issuecomment-1990731964, or unsubscribe https://github.com/notifications/unsubscribe-auth/ACE57Z5UWFAGY67JTONJMOTYX2LFNAVCNFSM6AAAAABEPXRUCWVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMYTSOJQG4ZTCOJWGQ . You are receiving this because you were mentioned.Message ID: @.***>

sbslee commented 8 months ago

Yes, do take a look at the observed variants. In the meantime, could you also send me the exact PyPGx commands you used?

NourMarzouka commented 8 months ago

Thank you so much for your helpful advice! I took a closer look at the BAMs and VCFs from Dragen and PGx, and everything looks great with them. It turns out that by mistake some variants were removed from the Dragen VCF which led to fewer variations in the Dragen VCF file and so less variants in the phenotype output.

After fixing the VCF file, the results are very similar, with just a tiny 1-2% difference in the phenotype output because of a phase switch. This is minor, however, it is a bit puzzling as Beagle was used to phase the files before running PyPGx in both, any hints?

You can close the current issue. I am so grateful for your time and assistance.

sbslee commented 8 months ago

@NourMarzouka,

I'm glad to hear that the VCF issue has been resolved. I will close this issue.

As for the question on phasing, note that statistical phasing has some built-in randomness so the results are not 100% deterministic. You can control this, to a certain degree, by setting a seed, etc., so generally your diplotype results are stable for a given input VCF file. But once you start comparing two VCF files generated by using two different callers from the same BAM, then because the input files are 100% identical (e.g., even if 1 variant has a different genotype between the two VCF files) and you can have slightly different diplotype results. This has already been well documented in my previous publications regarding Stargazer and PyPGx. You can find these publications here.

NourMarzouka commented 8 months ago

Thanks for the feedback!

On Thu, 14 Mar 2024, 3:35 am sbslee, @.***> wrote:

@NourMarzouka https://github.com/NourMarzouka,

I'm glad to hear that the VCF issue has been resolved. I will close this issue.

As for the question on phasing, note that statistical phasing has some built-in randomness so the results are not 100% deterministic. You can control this, to a certain degree, by setting a seed, etc., so generally your diplotype results are stable for a given input VCF file. But once you start comparing two VCF files generated by using two different callers from the same BAM, then because the input files are 100% identical (e.g., even if 1 variant has a different genotype between the two VCF files) and you can have slightly different diplotype results. This has already been well documented in my previous publications regarding Stargazer and PyPGx. You can find these publications here https://pypgx.readthedocs.io/en/latest/readme.html#citation.

— Reply to this email directly, view it on GitHub https://github.com/sbslee/pypgx/issues/123#issuecomment-1996085118, or unsubscribe https://github.com/notifications/unsubscribe-auth/ACE57Z2TU3B7BPELCYWZQF3YYDPE3AVCNFSM6AAAAABEPXRUCWVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMYTSOJWGA4DKMJRHA . You are receiving this because you were mentioned.Message ID: @.***>