Closed sci-kai closed 11 months ago
I see you have "umi_read_structure": "NA"
in your params-file. I wonder if that is valid.
I was able to run
nextflow run nf-core/sarek -r 3.3.2 -profile singularity,test_cache,umi -plugins nf-validation@1.0.0 --outdir results --skip_tools markduplicates,baserecalibrator --tools freebayes,strelka
but when I added --umi_read_structure NA
to that cmd, FastqToBam
crashed with
Argument 'read-structures' could not be constructed from string
...Could not build a 'ReadStructure' from a string: NA: null
Can you please provide the .nextflow.log?
Yes, I did not run the software version dump in this run since this produced another error probably unrelated to this.
@asp8200 I already discussed this setting in issue #802 and it worked in previous versions (but with other data and slightly different configuration). I added the flag --extract-umis-from-read-names
to fgbio in my run.config
file which extracts the UMIs from the read names. I need to give the 'NA' value to --umi_read_structure
to trigger the run of the UMI workflow, otherwise it just skips this process.
do you also have the pipeline_info/execution_trace.html?
execution_trace_2023-11-06_09-55-29.txt Here you go
Hey! I am unable to reproduce. I don't have any UMI "real" data at the moment. Any chance you can extract a few reads and send them as a minimal reproducible example?
I cannot share from the current data, but I try to reproduce the error with another minimal datasets which I can share. I tried myself finding the problem and run Sarek without the UMI workflow, which still produces the same behaviour. Changing the following parameters:
"umi_read_structure": null,
"group_by_umi_strategy": null,
and removing the FASTQTOBAM modification in run.config
(and running for only one sample).
It still stops at the exact same timepoint:
nextflow.log
execution_trace_2023-11-06_11-50-07.txt
very odd. I have non-umi data. I'll check if I see the same when I run it with that. I have a hunch actually, can you check what happens if you run with --split_fastq 0
?
I tested the --split_fastq 0
option, still has the same result.
For the non-UMI workflow I also set --skip_tools null
.
I prepared some reads from the newest GIAB HG008 tumor-normal Illumina WGS data which also reproduced the error, but did not run the UMI workflow.
For clarity I will provide all config files again for this minimal example:
sampledata (HG008, Illumina WGS, first 1000 lines (250 reads) extracted from here): HG008-Illumina-250reads.tar.gz .nextflow.log: nextflow.log execution_trace: execution_trace_2023-11-06_18-03-49.txt
nextflow run nf-core-sarek_3.3.2/3_3_2 \
-profile singularity \
-params-file config/nf-params-GIAB.json \
-plugins nf-validation@1.0.0 \
-resume
samplesheet.csv:
patient,status,lane,sample,sex,fastq_1,fastq_2
HG008,0,HG008-normal,1,XX,data/GIAB_HG008/HG008-N_Illumina_R1.fastq.gz,data/GIAB_HG008/HG008-N_Illumina_R2.fastq.gz
HG008,1,HG008-tumor,1,XX,data/GIAB_HG008/HG008-T_Illumina_R1.fastq.gz,data/GIAB_HG008/HG008-T_Illumina_R2.fastq.gz
nf-params-GIAB.json:
{
"input": "config/samplesheet-GIAB.csv",
"step": "mapping",
"outdir": "results/",
"fasta": "db/references/Homo_sapiens/GATK/GRCh38/Sequence/WholeGenomeFasta/Homo_sapiens_assembly38.fasta",
"split_fastq": 50000000,
"wes": true,
"intervals": "data/Target_Region_Hg38_V8_chr.bed",
"nucleotides_per_second": 200000.0,
"no_intervals": false,
"tools": "freebayes,mutect2,strelka",
"skip_tools": null,
"trim_fastq": false,
"clip_r1": 0,
"clip_r2": 0,
"three_prime_clip_r1": 0,
"three_prime_clip_r2": 0,
"trim_nextseq": 0,
"save_trimmed": false,
"umi_read_structure": null,
"group_by_umi_strategy": null,
"save_split_fastqs": false,
"aligner": "bwa-mem2",
"save_mapped": true,
"save_output_as_bam": true,
"concatenate_vcfs": false,
"only_paired_variant_calling": true,
"joint_germline": false,
"joint_mutect2": false,
"ascat_min_base_qual": 20.0,
"ascat_min_counts": 10.0,
"ascat_min_map_qual": 35.0,
"cf_coeff": 0.05,
"cf_contamination_adjustment": false,
"cf_contamination": 0.0,
"cf_minqual": 0.0,
"cf_mincov": 0.0,
"cf_ploidy": "2",
"ignore_soft_clipped_bases": false,
"sentieon_haplotyper_emit_mode": "variant",
"vep_cache": "db/vep/",
"vep_include_fasta": false,
"vep_dbnsfp": false,
"dbnsfp_fields": "rs_dbSNP,HGVSc_VEP,HGVSp_VEP,1000Gp3_EAS_AF,1000Gp3_AMR_AF,LRT_score,GERP++_RS,gnomAD_exomes_AF",
"vep_loftee": false,
"vep_spliceai": false,
"vep_spliceregion": false,
"vep_custom_args": "--everything --filter_common --per_gene --total_length --offline --format vcf",
"use_annotation_cache_keys": false,
"vep_out_format": "vcf",
"genome": "GATK.GRCh38",
"save_reference": true,
"build_only_index": false,
"download_cache": false,
"igenomes_base": "db/references",
"igenomes_ignore": false,
"custom_config_version": "master",
"custom_config_base": "nf-core-sarek_3.3.2/configs",
"test_data_base": null,
"seq_platform": "ILLUMINA",
"max_cpus": 16,
"max_memory": "128.GB",
"max_time": "240.h",
"help": false,
"version": false,
"publish_dir_mode": "copy",
"plaintext_email": false,
"max_multiqc_email_size": "25.MB",
"monochrome_logs": false,
"validate_params": true,
"validationShowHiddenParams": false,
"validationFailUnrecognisedParams": false,
"validationLenientMode": false,
"snpeff_db": null,
}
Same thing happened here.
Can you test with the test_full profile to see if that works you?
I now also tried running it locally and a colleague at another institute tried the minimal example, both giving the same error as described.
The test
profile is working as expected also running the variant calling, while the test_full
is getting stucked and does not finish at all. I am not sure if it just needs a lot of time and this is expected?
Here is the nextflow.log and execution trace from the manually interrupted test_full
run:
nextflow.log
execution_trace_2023-11-10_13-12-01.txt
I identified the problem:
Within the samplesheet, I put the wrong header for lane
and sample
. Hence, I specified two different lanes ("HG008-normal" and "HG008-tumor") for one sample ("1") which then does not performs somatic variant calling.
Fixing this with switching the header labels solved this issue.
Is the issue solved for you then?
Fantastic solve 🚀 . I am not sure we can lint for this, but we should keep it in mind
Description of the bug
Hi, I recently run sarek with the UMI pipeline for WES data, it successfully finishs but only performs alignment and does not perform variant calling. For the UMI pipeline I skipped markduplicates and baserecalibration and extract read names from FASTQ file according to issue #746 . Also, this workflow is running in offline mode. I do not know what configuration to change so the workflow also perform variant calling.
Command used and terminal output
The workflow performs software dump and multiQC after finishing the following process for each sample:
NFCORE_SAREK:SAREK:FASTQ_ALIGN_BWAMEM_MEM2_DRAGMAP_SENTIEON:BWAMEM2_MEM
{ "input": "config/samplesheet.csv", "step": "mapping", "outdir": "results/", "fasta": "db/references/Homo_sapiens/GATK/GRCh38/Sequence/WholeGenomeFasta/Homo_sapiens_assembly38.fasta", "split_fastq": 50000000, "wes": true, "intervals": "data/Target_Region_Hg38_V8_chr.bed", "nucleotides_per_second": 200000.0, "no_intervals": false, "tools": "freebayes,mutect2,strelka", "skip_tools": "markduplicates,baserecalibrator", "trim_fastq": false, "clip_r1": 0, "clip_r2": 0, "three_prime_clip_r1": 0, "three_prime_clip_r2": 0, "trim_nextseq": 0, "save_trimmed": false, "umi_read_structure": "NA", "group_by_umi_strategy": "Adjacency", "save_split_fastqs": false, "aligner": "bwa-mem2", "save_mapped": true, "save_output_as_bam": true, "concatenate_vcfs": false, "only_paired_variant_calling": true, "joint_germline": false, "joint_mutect2": false, "ascat_min_base_qual": 20.0, "ascat_min_counts": 10.0, "ascat_min_map_qual": 35.0, "cf_coeff": 0.05, "cf_contamination_adjustment": false, "cf_contamination": 0.0, "cf_minqual": 0.0, "cf_mincov": 0.0, "cf_ploidy": "2", "ignore_soft_clipped_bases": false, "sentieon_haplotyper_emit_mode": "variant", "vep_cache": "db/vep/", "vep_include_fasta": false, "vep_dbnsfp": false, "dbnsfp_fields": "rs_dbSNP,HGVSc_VEP,HGVSp_VEP,1000Gp3_EAS_AF,1000Gp3_AMR_AF,LRT_score,GERP++_RS,gnomAD_exomes_AF", "vep_loftee": false, "vep_spliceai": false, "vep_spliceregion": false, "vep_custom_args": "--everything --filter_common --per_gene --total_length --offline --format vcf", "use_annotation_cache_keys": false, "vep_out_format": "vcf", "genome": "GATK.GRCh38", "save_reference": true, "build_only_index": false, "download_cache": false, "igenomes_base": "db/references", "igenomes_ignore": false, "custom_config_version": "master", "custom_config_base": "nf-core-sarek_3.3.2/configs", "test_data_base": null, "seq_platform": "ILLUMINA", "max_cpus": 16, "max_memory": "128.GB", "max_time": "240.h", "help": false, "version": false, "publish_dir_mode": "copy", "plaintext_email": false, "max_multiqc_email_size": "25.MB", "monochrome_logs": false, "validate_params": true, "validationShowHiddenParams": false, "validationFailUnrecognisedParams": false, "validationLenientMode": false, "snpeff_db": null, }