lindenb / jvarkit

Java utilities for Bioinformatics
https://jvarkit.readthedocs.io/
Other
478 stars 132 forks source link

Depth of fixed genotypes seems incorrect #156

Closed RNieuwenhuis closed 4 years ago

RNieuwenhuis commented 4 years ago

Subject of the issue

Hi,

I have 100 samples in my vcf named 01..100 and want to fix the genotypes using FixVcfMissingGenotypes . To do this I split up my vcf in parts of 50K lines and then run

java -jar fixvcfmissinggenotypes.jar --bams Bams.list -o "Filled/${NUMBER}.vcf" --filter "record.getMappingQuality()<20 || record.getDuplicateReadFlag() || record.getReadFailsVendorQualityCheckFlag() || record.isSecondaryOrSupplementary()" "Part_${NUMBER}.vcf"

The contents of Bams.list are absolute paths to symlinks. This is because the original bam filename does not directly match the sample name. So Bams.list contains:

/path/to/01.bam
/path/to/02.bam
etc.

The output shows consistently much higher values for DP than expected. Alignments show depth between 20 - 40 consistently whereas for all fixed genotypes it is ranging from 100 - 2000.

Example_output.txt

I suspect I misunderstood how the program works with regards to linking the correct bam to the samples. I have for most samples multiple read groups in each bam because they come from multiple sequencing runs. My understanding was that if a sample is called <sample> if will look for the genotype in <sample>.bam. The help of --partition is unclear to me but seems to be of use for solving this problem.

In 01.bam I have for instance multiple read groups that:

@RG     ID:AHHT7NDSXX.L000.01   CN:WageningenUR SM:01   LB:01   PU:AHHT7NDSXX.L000.01   PL:illumina
@RG     ID:BCCTH5ANXX.L003.01   CN:WageningenUR SM:01   LB:01   PU:BCCTH5ANXX.L003.01   PL:illumina
@RG     ID:BCCTLJANXX.L001.01   CN:WageningenUR SM:01   LB:01   PU:BCCTLJANXX.L001.01   PL:illumina
@RG     ID:BHHLM5DSXX.L001.01   CN:WageningenUR SM:01   LB:01   PU:BHHLM5DSXX.L001.01   PL:illumina
@RG     ID:BHHLM5DSXX.L002.01   CN:WageningenUR SM:01   LB:01   PU:BHHLM5DSXX.L002.01   PL:illumina
@RG     ID:BHHLM5DSXX.L003.01   CN:WageningenUR SM:01   LB:01   PU:BHHLM5DSXX.L003.01   PL:illumina
@RG     ID:BHHLM5DSXX.L004.01   CN:WageningenUR SM:01   LB:01   PU:BHHLM5DSXX.L004.01   PL:illumina

Your environment

REDHAT_BUGZILLA_PRODUCT="Scientific Linux 7" REDHAT_BUGZILLA_PRODUCT_VERSION=7.7 REDHAT_SUPPORT_PRODUCT="Scientific Linux" REDHAT_SUPPORT_PRODUCT_VERSION="7.7"



### Steps to reproduce
I can provide an example if necessary but I am guess this is a [PICNIC](https://en.wiktionary.org/wiki/PICNIC#English)

### Expected behaviour
Correct DP value for fixed genotypes

### Actual behaviour
DP value incorrect

Can you please explain what I am doing wrong and please give advise on how to continue instead?
lindenb commented 4 years ago

Hi , I cannot check this problem without having a hand one subset of bam and one variant .

RNieuwenhuis commented 4 years ago

Thanks for your quick reply.

Minimum_example_pierre.tar.gz

Regards,

lindenb commented 4 years ago

ah got it ! My bad: I re-counted the DP for each read group in the same BAM. That's why the DP was so high. Thank you for pointing this ! Tell me if that works now please.

RNieuwenhuis commented 4 years ago

Successful bug fix.

Thanks, and also thank you in general for your efforts in the bioinformatics community as a whole!