Assess nf-core/sarek - Githubissues

muffato commented 5 months ago

We need to review what https://nf-co.re/sarek can do to determine:

if it could be used as is
if it could be used with modifications
if we'd rather extract and replicate some functionality here

In particular, we are interested in:

variant-calling for a single sample on long reads (preferably with DeepVariant)
variant-calling for a single sample on short reads
joint calling for multiple samples on short reads

Task list

[x] Trial nf-core/sarek on some test data (cf #81)
[x] Talk to the Sanger HGI team about nf-core/sarek vs their fork wtsi-hgi/sarek
[x] Check how to get per-sample gVCF (or VCF) from sarek
[x] Check how we get the multi-sample gVCF from sarek
[x] Check if/how bcftools (or vcftools) can convert a gVCF to VCF
[x] Check if/how bcftools (or vcftools) can extract 1 sample from a multi-sample gVCF
[ ] Check how we control the sharding / "intervals" in sarek
[ ] Check if/how the parallelisation works
[ ] Understand if/why nf-core/sarek can output 1 gVCF per individual

Test run 1

Two otter cram data files as input files and GCA_902655055.2_mLutLut1.2_genomic.fna as reference.
Modified line 132 and 133 dbsnp.map{ it -> [[:], []] }, dbsnp_tbi.map{ it -> [[:], []] } in sarek/subworkflows/local/bam_variant_calling_germline_all/main.nf to avoid errors for organisms without dbsnp
Started with --step markduplicates and selected --tools haplotypecaller and turned on joint_germline
See output slides https://docs.google.com/presentation/d/1lo6wNoBlJaQJ8PfIUb7mtqX1CEn3Sj-gj4vgpVN04OY/edit#slide=id.g28127127451_0_19

Summary

Sarek can take multiple fastq files per individual or a single cram file per individual as the input files.
The default interval size is nucleotides_per_second = 200000, which splits the 2.5GB otter reference file into 15 intervals.
In the haplotypecaller step, a g.vcf.gz file is generated per individual per interval.
In the genomicsDBimport step, g.vcf.gz files per individual per interval are imported into a genomicsDB per interval. (not sure about this interpretation)
In the joint calling step, a vcf.gz file is generated per interval.
In the merging step, g.vcf.gz per individual per interval is merged into g.vcf.gz per individual. The vcf.gz per interval is merged into a vcf.gz file.
The final output contains g.vcf.gz files per individual and a joint vcf.gz.
For the n+1 problem, wtsi-hgi/sarek gives an entry point between GATK Haplotypecaller and GATK genomicsDBimport. It seems it will require path(gvcf), path(intervals), path(gvcf_index), path(intervals_index) as the input for starting at GATK genomicsDBimport.
In Sarek, mpileup can generate individual gvcf files but no joint calling. Currently, only GATK can do joint calling.
Many parameters are set to work on human. For non-human data, we had to add --genome null --igenomes_ignore --skip_tools baserecalibrator.
wtsi-hgi/sarek uses a different tool nomenclature: haplotypecaller_vc for per-individual variant-calling, and haplotypecaller_jc for joint calling (as opposite to haplotypecaller in the nf-core sarek, to run all at once).

Next developments

Based on the tests above, to use sarek, we would want to add:

support for joint-calling after bcftools mpileup
a preset for non-human, non-model organisms, to simplify running the pipeline.

hangxue-wustl commented 4 months ago

Tried the following commands that either start from beginning or start from variant calling

nextflow run nf-core/sarek -profile singularity --outdir /global/scratch/users/hangxue/otter/sarek/cram_test --input /global/scratch/users/hangxue/otter/otter_cram.csv --genome null --igenomes_ignore --fasta /global/scratch/users/hangxue/otter/genomes/GCA_902655055.2_mLutLut1.2_genomic.fna --step variant_calling --skip_tools baserecalibrator --joint_germline --tools haplotypecaller

[hangxue@n0133 hangxue]$ nextflow run nf-core/sarek -profile singularity --outdir /global/scratch/users/hangxue/otter/sarek/fastq_test --input /global/scratch/users/hangxue/otter/otter_fastq.csv --genome null --igenomes_ignore --fasta /global/scratch/users/hangxue/otter/genomes/Lutralutra_chr1.fna --skip_tools baserecalibrator --joint_germline --tools haplotypecaller

Both have the same error. Troubleshooting in progress. "ERROR ~ Cannot get property 'baseName' on null object

-- Check script '/global/home/users/hangxue/.nextflow/assets/nf-core/sarek/./workflows/sarek/../../subworkflows/local/bam_ variant_calling_germline_all/main.nf' at line: 133 or see '.nextflow.log' file for more detailsERROR ~ Cannot get property 'baseName' on null object"

hangxue-wustl commented 4 months ago

I think the above error is related to haplotypecaller because the following command is able to start fine. Note that The GATK's Haplotypecaller, Sentieon's Dnascope or Sentieon's Haplotyper should be specified as one of the tools when doing joint germline variant calling. Troubleshooting in progress.

nextflow run nf-core/sarek -profile singularity --outdir /global/scratch/users/hangxue/otter/sarek/fastq_test --input /global/scratch/users/hangxue/otter/otter_fastq.csv --genome null --igenomes_ignore --fasta /global/scratch/users/hangxue/otter/genomes/Lutralutra_chr1.fna --skip_tools baserecalibrator

Update: Issue https://github.com/nf-core/sarek/issues/1550 points out that using haplotypecaller without passing a --dbsnp will cause such error. However, this error is supposed to be fixed four years ago https://github.com/nf-core/sarek/pull/182

hangxue-wustl commented 4 months ago

Related discussions in sarek that talks about joint calling and why n+1 requires rerunning the haplotypecaller for all samples. https://github.com/nf-core/sarek/issues/755 and https://github.com/nf-core/sarek/issues/868 and https://github.com/nf-core/sarek/pull/1172

tldr: when no interval is true, joint germline variant calling will not be performed. When all intervals are produced in one DB, sarek should be able to produce per sample gvcf file and then joint calling. However, if one DB is used per interval, it is quite some work to merge/organize the DBs for n+1.

muffato commented 4 months ago

tldr: it seems that when no interval is used, sarek should be able to produce per sample gvcf file and then joint calling. However, when there are intervals, one DB is used per interval and it is quite some work to merge/organize the DBs for n+1.

Right, I see. In other words, if the genome is small enough / the runtime is reasonable, then variant-calling could be done on the entire genome at once, i.e. without intervals, and then we'd naturally get gVCFs per sample. I don't have any runtime data, but I guess the option to do intervals was introduced because calling variants on a 3 Gbp genome may take a while ? My gut feeling is that we'll face at some point a genome that's large enough and causing a bottleneck

Re. the dbsnp issue, does the suggested workaround work for you ?

hangxue-wustl commented 4 months ago

Re. the dbsnp issue, does the suggested workaround work for you ?

I have not modified the code yet. Not sure if we want to fork the repo and make changes. Using a different caller, sentieon_haplotyper, is able to start the pipeline. sentieon_haplotyper should also be able to produce gvcf file. See https://github.com/nf-core/sarek/pull/1007

[hangxue@n0028 hangxue]$ nextflow run nf-core/sarek -profile singularity --outdir /global/scratch/users/hangxue/otter/sarek/fastq_test --input /global/scratch/users/hangxue/otter/otter_fastq.csv --genome null --igenomes_ignore --fasta /global/scratch/users/hangxue/otter/genomes/Lutralutra_chr1.fna --skip_tools baserecalibrator --joint_germline --tools sentieon_haplotyper --sentieon_haplotyper_emit_mode gvcf

Update: The suggested workaround by modifying line 132 and 133 in sarek/subworkflows/local/bam_variant_calling_germline_all/main.nf is able to start the haplotypecaller without passing a --dbsnp

nextflow run sarek -profile singularity --outdir /global/scratch/users/hangxue/otter/sarek/fastq_test --input /global/scratch/users/hangxue/otter/otter_fastq.csv --genome null --igenomes_ignore --fasta /global/scratch/users/hangxue/otter/genomes/Lutralutra_chr1.fna --skip_tools baserecalibrator --joint_germline --tools haplotypecaller

hangxue-wustl commented 4 months ago

Update for n+1 problem: wtsi-hgi/sarek gives an entry point between GATK Haplotypecaller and GATK genomicsDBimport. It seems it will require path(gvcf), path(intervals), path(gvcf_index), path(intervals_index) as the input for starting at GATK genomicsDBimport.

hangxue-wustl commented 4 months ago

Converting a single Gvcf Files Into Vcf using gvcftools gzip -dc sample.genome.vcf.gz | extract_variants | bgzip -c > sample.vcf.gz

Extract a subset of samples from a multi-sample VCF bcftools view -s samplelist file.gz

hangxue-wustl commented 3 months ago

Update for sarek test run on otter cram data. See output at https://docs.google.com/presentation/d/1lo6wNoBlJaQJ8PfIUb7mtqX1CEn3Sj-gj4vgpVN04OY/edit#slide=id.g28127127451_0_29

In summary, sarek outputs g.vcf.gz files per individual and a joint folder containing vcf.gz in the final output folder. During intermediate steps, it has g.vcf.gz files per individual per interval and a joint calling vcf.gz file per interval

muffato commented 2 months ago

Update: Issue nf-core/sarek#1550 points out that using haplotypecaller without passing a --dbsnp will cause such error. However, this error is supposed to be fixed four years ago nf-core/sarek#182

FYI this is now fixed in release 3.4.4 of sarek

sanger-tol / variantcalling

Assess nf-core/sarek #79

Task list

Test run 1

Summary

Next developments