luntergroup / octopus

Bayesian haplotype-based mutation calling
MIT License
299 stars 37 forks source link

Variant calling in population mode (Questions) #217

Closed Danisov closed 2 years ago

Danisov commented 2 years ago

Hi Dan,

I am a new user of octopus, I wanted to use it to detect small variant germlines in population mode. In total, I have 25 unrelated WGS (homo sapiens). Before running octopus I am understanding the appropriate options for my cohort.

In the documentation, it is mentioned that octopus will select the 'best' max-genotype-combinations. Increasing this value may improve accuracy. (i) what is the best value of "max-genotype-combinations" on which octopus is based for this number of samples (25 wgs)? (ii) What is the maximum number of WGS samples to be used for joint variant calling in population mode (In the commands below I will use 25 WGS) (iii) Still in population mode, does octopus take each sample into consideration when calling haploid variants on the chrX of male samples (please see the command I will use below).

Thanks in advance, Danisov

octopus version 0.7.4
Target: x86_64 Linux 4.9.0-6-amd64
SIMD extension: SSE2
Compiler: GNU 9.4.0
Boost: 1_69
/octopus-0.7.4/bin/octopus \
     -R GRCh38.fa \
     -I sample1.bam sample2.bam sample3.bam...sample25.bam \
     -p sample5:X=1 \
     -p sample10:X=1 \
     -p sample20:X=1 \
     -p sample24:X=1 \
     --forest-model ${forest} \
     -o ${output}/octopus.cohort.vcf.gz \
     --sequence-error-model PCR-FREE.NOVASEQ \
     --threads 16 \
     --debug
dancooke commented 2 years ago

Hi @Danisov, sorry for the slow reply.

(i) what is the best value of "max-genotype-combinations" on which octopus is based for this number of samples (25 wgs)?

Depends on your definition of "best" - a higher value will result in better quality score (e.g. GQ) resolution and should improve accuracy, but runtime is linearly dependent on this parameter. For 25 samples the default should strike a reasonable balance.

(ii) What is the maximum number of WGS samples to be used for joint variant calling in population mode (In the commands below I will use 25 WGS)

There's no enforced limit, but performance (accuracy or runtime) on large cohorts hasn't been evaluated. A strategy for calling large cohorts is outlined in the docs, but this is untested.

(iii) Still in population mode, does octopus take each sample into consideration when calling haploid variants on the chrX of male samples (please see the command I will use below).

Yes, but you should also set the Y ploidy to 0 for female samples (e.g. sample2:Y=0).

Danisov commented 2 years ago

Hello Dan,

Thanks for your feedback, I changed the method I mentioned above and applied the large cohort calling strategy that is described in the docs. I have called individually each genome up to 20 (as recommended in the document), please see the first commands. Then I added the other five genomes, but not sure whether to also add the first 20 bam files in input with the five new bam. From the second step, please see the second commands, I added the first 20 files bam with five new bam in input), and the same in option "-c" is this method correct?

Thanks Danisov


$ octopus  -R GRCh38.fa -I sample1.bam -p sample1:X=1 --forest-model ${forest} --sequence-error-model PCR-FREE.NOVASEQ --threads 16 -o sample1.bcf
$ octopus  -R GRCh38.fa -I sample2.bam -p sample2:Y=0 --forest-model ${forest} --sequence-error-model PCR-FREE.NOVASEQ --threads 16 -o sample2.bcf
$ octopus  -R GRCh38.fa -I sampleN20.bam  ${forest} --sequence-error-model PCR-FREE.NOVASEQ --threads 16 -o sampleN20.bcf  
$ octopus -R GRCh38.fa-I sample1.bam sample2.bam ...sampleN20.bam \
--forest-model ${forest} \
--sequence-error-model PCR-FREE.NOVASEQ \
 --disable-denovo-variant-discovery \
 -c sample1.bcf sample2.bcf sam
```pleN20.bcf \
 -o octopus.joint_20.samples.bcf

$ octopus  -R GRCh38.fa -I sample21_new.bam -p sample21_new:X=1 --forest-model ${forest} --sequence-error-model PCR-FREE.NOVASEQ --threads 16 -o sample21.bcf
$ octopus  -R GRCh38.fa -I sample22_new.bam -p sample22_new:Y=0 --forest-model ${forest} --sequence-error-model PCR-FREE.NOVASEQ --threads 16 -o sample22.bcf
$ octopus  -R GRCh38.fa -I sampleN25.bam  ${forest} sampleN25:X=1 --sequence-error-model PCR-FREE.NOVASEQ --threads 16 -o sampleN25.bcf

$ octopus -R GRCh38.fa -I sample1.bam...sample25.bam \
--forest-model ${forest} \
--sequence-error-model PCR-FREE.NOVASEQ \
 --disable-denovo-variant-discovery \
 -c  octopus.joint_20.samples.bcf sample21.bcf  sample21.bcf sample22.bcf sample23.bcf sample24.bcf sample25.bcf \
 -o octopus.joint_25.samples.bcf
Talo07 commented 2 years ago

Hi,

To perform the variant call reliably. I had almost the same questions regarding the second step of this strategy, is it recommended that the second step grouping for WGS (human 40X) is 20 samples maximum? and is that still the case for the exome? and do we include the bam files from the first step or just their bcf files with the "-c" option?

Thank you in advance for your answer, Talo

Danisov commented 2 years ago

Unless I am mistaken, I did not receive an answer to my question which was sent 24 days ago.

For the news on these steps, after 20 genomes individual variant calling, I performed the second step, this step took 10 days to run (with the --fast option) and gave an error on the forest file v0.7.4. This same file I used for individual variant calling and it worked fine.

Do you have any suggestions on the previous questions and this bug ?

Thank you, Danisov

dancooke commented 2 years ago

@Danisov

I have called individually each genome up to 20 (as recommended in the document), please see the first commands. Then I added the other five genomes, but not sure whether to also add the first 20 bam files in input with the five new bam. From the second step, please see the second commands, I added the first 20 files bam with five new bam in input), and the same in option "-c" is this method correct?

I think for your case you can forgo the 'intermediate' batching step and just call all 25 samples jointly - the 20 samples mentioned in docs is only meant as a guide and has not been tested. Note that the primary purpose of the outlined calling strategy is to reduce runtime - if you find runtimes are reasonable joint calling all samples without the intermediate steps then just do that.

For the news on these steps, after 20 genomes individual variant calling, I performed the second step, this step took 10 days to run (with the --fast option) and gave an error on the forest file v0.7.4. This same file I used for individual variant calling and it worked fine.

It looks like you're running the last step without multithreading? You can also use the --threads option here to speed things up. Please post a separate issue regarding the error.

@Talo07

is it recommended that the second step grouping for WGS (human 40X) is 20 samples maximum?

Essentially no - see answer to @Danisov above.

is that still the case for the exome?

For exome you can probably call larger cohorts without intermediae 'batching' as runtimes are usually a lot less of concern (unless depth is very high).

do we include the bam files from the first step or just their bcf files with the "-c" option?

You always need to include the BAM files - Octopus doesn't have functionality like GATKs GenotypeGVCFs.

IdoBar commented 1 year ago

Hi @dancooke ,

I'm trying to use octopus on a large cohort of >250 WGS samples of a fungal pathogen (42Mb haploid genome). I followed the guidelines for variant calling in population mode and called them first individually, then combined the individual files in 13 batches of 20 samples, followed by a final joint calling of those 13 joint-called files (at least that's how I interpreted the approach, correct me if I'm wrong). It all worked well and quickly (~20-40 minutes for each batch) until the final joint calling step that's been running for 21 hours and is predicted to finish in a week!

Am I missing something there? Is there another way to improve performance? I am using multithreads.

Thanks, Ido

IdoBar commented 1 year ago

Just following back on this, is it possible to call variants in each chromosome in parallel and then combine them together (like recommended in Freebayes)?

Thanks, Ido

jelber2 commented 1 year ago

You can call variants in parallel, if you customize some things. I did this https://github.com/jelber2/variant-calling/blob/c19267f0285007d9cb38c7e8d1756505a2eeb11f/da0028.md when I was at ISTA for C elegans variants. Not a very elegant way of doing it, but it worked on our cluster pretty well. Not sure about it working with population mode in octopus though.

On Sat, Jun 3, 2023, 10:24 Ido Bar @.***> wrote:

Just following back on this, is it possible to call variants in each chromosome in parallel and then combine them together (like recommended in Freebayes https://github.com/freebayes/freebayes#parallelisation)?

Thanks, Ido

— Reply to this email directly, view it on GitHub https://github.com/luntergroup/octopus/issues/217#issuecomment-1574778273, or unsubscribe https://github.com/notifications/unsubscribe-auth/ABNND2WGX64LQZ3BRARUUBLXJLYEFANCNFSM5HCP3U4A . You are receiving this because you are subscribed to this thread.Message ID: @.***>

IdoBar commented 1 year ago

Thanks for your suggestion @jelber2 , I can easily follow this workflow, but to my understanding joint-calling multiple samples is more accurate than calling them individually and then merging them (particularly with haplotype callers such as Octopus and Freebayes, see here and here). It also provides far more accurate loci information when considering all samples.

I also didn't understand the last sentence in the population tutorial mentioning -g and -a flags (which I wasn't able to find in the documentation). image

jelber2 commented 1 year ago

I apologize @IdoBar , I guess what my new suggestion would be to do is to actually split your 13 joint-called variant files and your 13 'joint-called' BAM files if what you are following the tutorial does not finish in a week or fails. Does that make sense to have maybe per chromosome/contig joint-called variant and BAM files? If they are per contig/chromosome, then that cannot affect the phasing results.