Open muffato opened 5 months ago
Tried the following commands that either start from beginning or start from variant calling
nextflow run nf-core/sarek -profile singularity --outdir /global/scratch/users/hangxue/otter/sarek/cram_test --input /global/scratch/users/hangxue/otter/otter_cram.csv --genome null --igenomes_ignore --fasta /global/scratch/users/hangxue/otter/genomes/GCA_902655055.2_mLutLut1.2_genomic.fna --step variant_calling --skip_tools baserecalibrator --joint_germline --tools haplotypecaller
[hangxue@n0133 hangxue]$ nextflow run nf-core/sarek -profile singularity --outdir /global/scratch/users/hangxue/otter/sarek/fastq_test --input /global/scratch/users/hangxue/otter/otter_fastq.csv --genome null --igenomes_ignore --fasta /global/scratch/users/hangxue/otter/genomes/Lutralutra_chr1.fna --skip_tools baserecalibrator --joint_germline --tools haplotypecaller
Both have the same error. Troubleshooting in progress. "ERROR ~ Cannot get property 'baseName' on null object
-- Check script '/global/home/users/hangxue/.nextflow/assets/nf-core/sarek/./workflows/sarek/../../subworkflows/local/bam_ variant_calling_germline_all/main.nf' at line: 133 or see '.nextflow.log' file for more detailsERROR ~ Cannot get property 'baseName' on null object"
I think the above error is related to haplotypecaller because the following command is able to start fine. Note that The GATK's Haplotypecaller, Sentieon's Dnascope or Sentieon's Haplotyper should be specified as one of the tools when doing joint germline variant calling. Troubleshooting in progress.
nextflow run nf-core/sarek -profile singularity --outdir /global/scratch/users/hangxue/otter/sarek/fastq_test --input /global/scratch/users/hangxue/otter/otter_fastq.csv --genome null --igenomes_ignore --fasta /global/scratch/users/hangxue/otter/genomes/Lutralutra_chr1.fna --skip_tools baserecalibrator
Update: Issue https://github.com/nf-core/sarek/issues/1550 points out that using haplotypecaller without passing a --dbsnp will cause such error. However, this error is supposed to be fixed four years ago https://github.com/nf-core/sarek/pull/182
Related discussions in sarek that talks about joint calling and why n+1 requires rerunning the haplotypecaller for all samples. https://github.com/nf-core/sarek/issues/755 and https://github.com/nf-core/sarek/issues/868 and https://github.com/nf-core/sarek/pull/1172
tldr: when no interval is true, joint germline variant calling will not be performed. When all intervals are produced in one DB, sarek should be able to produce per sample gvcf file and then joint calling. However, if one DB is used per interval, it is quite some work to merge/organize the DBs for n+1.
tldr: it seems that when no interval is used, sarek should be able to produce per sample gvcf file and then joint calling. However, when there are intervals, one DB is used per interval and it is quite some work to merge/organize the DBs for n+1.
Right, I see. In other words, if the genome is small enough / the runtime is reasonable, then variant-calling could be done on the entire genome at once, i.e. without intervals, and then we'd naturally get gVCFs per sample. I don't have any runtime data, but I guess the option to do intervals was introduced because calling variants on a 3 Gbp genome may take a while ? My gut feeling is that we'll face at some point a genome that's large enough and causing a bottleneck
Re. the dbsnp issue, does the suggested workaround work for you ?
Re. the dbsnp issue, does the suggested workaround work for you ?
I have not modified the code yet. Not sure if we want to fork the repo and make changes. Using a different caller, sentieon_haplotyper, is able to start the pipeline. sentieon_haplotyper should also be able to produce gvcf file. See https://github.com/nf-core/sarek/pull/1007
[hangxue@n0028 hangxue]$ nextflow run nf-core/sarek -profile singularity --outdir /global/scratch/users/hangxue/otter/sarek/fastq_test --input /global/scratch/users/hangxue/otter/otter_fastq.csv --genome null --igenomes_ignore --fasta /global/scratch/users/hangxue/otter/genomes/Lutralutra_chr1.fna --skip_tools baserecalibrator --joint_germline --tools sentieon_haplotyper --sentieon_haplotyper_emit_mode gvcf
Update: The suggested workaround by modifying line 132 and 133 in sarek/subworkflows/local/bam_variant_calling_germline_all/main.nf is able to start the haplotypecaller without passing a --dbsnp
nextflow run sarek -profile singularity --outdir /global/scratch/users/hangxue/otter/sarek/fastq_test --input /global/scratch/users/hangxue/otter/otter_fastq.csv --genome null --igenomes_ignore --fasta /global/scratch/users/hangxue/otter/genomes/Lutralutra_chr1.fna --skip_tools baserecalibrator --joint_germline --tools haplotypecaller
Update for n+1 problem: wtsi-hgi/sarek gives an entry point between GATK Haplotypecaller and GATK genomicsDBimport. It seems it will require path(gvcf), path(intervals), path(gvcf_index), path(intervals_index) as the input for starting at GATK genomicsDBimport.
Converting a single Gvcf Files Into Vcf using gvcftools gzip -dc sample.genome.vcf.gz | extract_variants | bgzip -c > sample.vcf.gz
Extract a subset of samples from a multi-sample VCF bcftools view -s samplelist file.gz
Update for sarek test run on otter cram data. See output at https://docs.google.com/presentation/d/1lo6wNoBlJaQJ8PfIUb7mtqX1CEn3Sj-gj4vgpVN04OY/edit#slide=id.g28127127451_0_29
In summary, sarek outputs g.vcf.gz files per individual and a joint folder containing vcf.gz in the final output folder. During intermediate steps, it has g.vcf.gz files per individual per interval and a joint calling vcf.gz file per interval
Update: Issue nf-core/sarek#1550 points out that using haplotypecaller without passing a --dbsnp will cause such error. However, this error is supposed to be fixed four years ago nf-core/sarek#182
FYI this is now fixed in release 3.4.4 of sarek
We need to review what https://nf-co.re/sarek can do to determine:
In particular, we are interested in:
Task list
Test run 1
GCA_902655055.2_mLutLut1.2_genomic.fna
as reference.dbsnp.map{ it -> [[:], []] }, dbsnp_tbi.map{ it -> [[:], []] }
in sarek/subworkflows/local/bam_variant_calling_germline_all/main.nf to avoid errors for organisms without dbsnpSummary
path(gvcf), path(intervals), path(gvcf_index), path(intervals_index)
as the input for starting at GATK genomicsDBimport.--genome null --igenomes_ignore --skip_tools baserecalibrator
.haplotypecaller_vc
for per-individual variant-calling, andhaplotypecaller_jc
for joint calling (as opposite tohaplotypecaller
in the nf-core sarek, to run all at once).Next developments
Based on the tests above, to use sarek, we would want to add:
bcftools mpileup