maxplanck-ie / snakepipes

Customizable workflows based on snakemake and python for the analysis of NGS data
http://snakepipes.readthedocs.io
378 stars 85 forks source link

Error in WGBS workflow (invalid 'times' argument) #898

Closed kaznt closed 1 year ago

kaznt commented 1 year ago

Hi team,

I ran WGBS workflow with publically available WGBS dataset and met an error. I checked the issuses but couldn't find same error I met. Below, I attached the logs.

Thank you for your help.

/home/user/miniconda3/envs/snakePipes/bin/WGBS -i input -o output GRCh38_gencode_release29 --local -j 30

TMPDIR=/tmp PYTHONNOUSERSITE=True /home/user/miniconda3/envs/snakePipes/bin/snakemake --use-conda --conda-prefix /home/user/miniconda3/envs --latency-wait 300 --snakefile /home/user/miniconda3/envs/snakePipes/lib/python3.11/site-packages/snakePipes/workflows/WGBS/Snakefile --jobs 30 --directory /home/user/230523bisulfite_test/output --configfile /home/user/230523bisulfite_test/output/WGBS.config.yaml --keep-going

/home/user/miniconda3/envs/snakePipes/lib/python3.11/site-packages/google/protobuf/internal/api_implementation.py:110: UserWarning: Selected implementation cpp is not available. warnings.warn( /home/user/miniconda3/envs/snakePipes/lib/python3.11/site-packages/fuzzywuzzy/fuzz.py:11: UserWarning: Using slow pure-python SequenceMatcher. Install python-Levenshtein to remove this warning warnings.warn('Using slow pure-python SequenceMatcher. Install python-Levenshtein to remove this warning')

---- This analysis has been done using snakePipes version 2.7.3 ---- Building DAG of jobs... Your conda installation is not configured to use strict channel priorities. This is however crucial for having robust and correct environments (for details, see https://conda-forge.org/docs/user/tipsandtricks.html). Please consider to configure strict priorities by executing 'conda config --set channel_priority strict'. Using shell: /bin/bash Provided cores: 30 Rules claiming more threads will be scaled down. Job stats: job count min threads max threads


DepthOfCov 1 20 20 DepthOfCovGenome 1 20 20 FASTQ1 1 1 1 FASTQ2 1 1 1 all 1 1 1 bedGraphToBigWig 1 1 1 bwameth 1 20 20 calcCHHbias 1 10 10 calc_Mbias 1 10 10 conversionRate 1 1 1 getRandomCpGs 1 1 1 get_flagstat 1 1 1 indexMarkDupes 1 1 1 index_bam 1 1 1 link_deduped_bam 1 1 1 markDupes 1 10 10 methyl_extract 1 10 10 multiQC 1 1 1 origFASTQ1 1 1 1 origFASTQ2 1 1 1 produceReport 1 1 1 total 21 1 20

Select jobs to execute...

[Wed May 24 19:17:31 2023] rule getRandomCpGs: output: QC_metrics/randomCpG.bed jobid: 12 reason: Missing output files: QC_metrics/randomCpG.bed resources: tmpdir=/tmp

[Wed May 24 19:17:31 2023] rule origFASTQ1: input: /home/user/230523bisulfite_test/input/SRR17642783_R1.fastq.gz output: originalFASTQ/SRR17642783_R1.fastq.gz jobid: 5 reason: Missing output files: originalFASTQ/SRR17642783_R1.fastq.gz wildcards: sample=SRR17642783 resources: tmpdir=/tmp

[Wed May 24 19:17:31 2023] rule origFASTQ2: input: /home/user/230523bisulfite_test/input/SRR17642783_R2.fastq.gz output: originalFASTQ/SRR17642783_R2.fastq.gz jobid: 7 reason: Missing output files: originalFASTQ/SRR17642783_R2.fastq.gz wildcards: sample=SRR17642783 resources: tmpdir=/tmp

[Wed May 24 19:17:31 2023] Finished job 5. 1 of 21 steps (5%) done Select jobs to execute...

[Wed May 24 19:17:31 2023] rule FASTQ1: input: originalFASTQ/SRR17642783_R1.fastq.gz output: FASTQ/SRR17642783_R1.fastq.gz jobid: 4 reason: Missing output files: FASTQ/SRR17642783_R1.fastq.gz; Input files updated by another job: originalFASTQ/SRR17642783_R1.fastq.gz wildcards: sample=SRR17642783 resources: tmpdir=/tmp

[Wed May 24 19:17:31 2023] Finished job 7. 2 of 21 steps (10%) done Select jobs to execute...

[Wed May 24 19:17:31 2023] rule FASTQ2: input: originalFASTQ/SRR17642783_R2.fastq.gz output: FASTQ/SRR17642783_R2.fastq.gz jobid: 6 reason: Missing output files: FASTQ/SRR17642783_R2.fastq.gz; Input files updated by another job: originalFASTQ/SRR17642783_R2.fastq.gz wildcards: sample=SRR17642783 resources: tmpdir=/tmp

[Wed May 24 19:17:31 2023] Finished job 4. 3 of 21 steps (14%) done [Wed May 24 19:17:31 2023] Finished job 6. 4 of 21 steps (19%) done Select jobs to execute...

[Wed May 24 19:17:31 2023] rule bwameth: input: FASTQ/SRR17642783_R1.fastq.gz, FASTQ/SRR17642783_R2.fastq.gz output: bwameth/SRR17642783.bam log: bwameth/logs/SRR17642783.map_reads.err, bwameth/logs/SRR17642783.map_reads.out jobid: 3 reason: Missing output files: bwameth/SRR17642783.bam; Input files updated by another job: FASTQ/SRR17642783_R1.fastq.gz, FASTQ/SRR17642783_R2.fastq.gz wildcards: sample=SRR17642783 threads: 20 resources: tmpdir=/tmp

Activating conda environment: ../../miniconda3/envs/e6ce94c5f4ccaba458228e8445f6d20d /home/user/miniconda3/envs/snakePipes/lib/python3.11/site-packages/google/protobuf/internal/api_implementation.py:110: UserWarning: Selected implementation cpp is not available. warnings.warn( /home/user/miniconda3/envs/snakePipes/lib/python3.11/site-packages/fuzzywuzzy/fuzz.py:11: UserWarning: Using slow pure-python SequenceMatcher. Install python-Levenshtein to remove this warning warnings.warn('Using slow pure-python SequenceMatcher. Install python-Levenshtein to remove this warning')

---- This analysis has been done using snakePipes version 2.7.3 ---- Building DAG of jobs... Using shell: /bin/bash Provided cores: 30 Rules claiming more threads will be scaled down. Select jobs to execute... [Wed May 24 19:21:47 2023] Finished job 12. 5 of 21 steps (24%) done [Fri May 26 05:46:31 2023] Finished job 3. 6 of 21 steps (29%) done Select jobs to execute...

[Fri May 26 05:46:31 2023] rule index_bam: input: bwameth/SRR17642783.bam output: bwameth/SRR17642783.bam.bai log: bwameth/logs/SRR17642783.index_bam.err, bwameth/logs/SRR17642783.index_bam.out jobid: 8 reason: Missing output files: bwameth/SRR17642783.bam.bai; Input files updated by another job: bwameth/SRR17642783.bam wildcards: sample=SRR17642783 resources: tmpdir=/tmp

Activating conda environment: ../../miniconda3/envs/081ae7e483778bba002ffd0e803c7dac [Fri May 26 06:00:18 2023] Finished job 8. 7 of 21 steps (33%) done Select jobs to execute...

[Fri May 26 06:00:18 2023] rule markDupes: input: bwameth/SRR17642783.bam, bwameth/SRR17642783.bam.bai output: Sambamba/SRR17642783.markdup.bam log: Sambamba/logs/SRR17642783.rm_dupes.err, Sambamba/logs/SRR17642783.rm_dupes.out jobid: 2 reason: Missing output files: Sambamba/SRR17642783.markdup.bam; Input files updated by another job: bwameth/SRR17642783.bam, bwameth/SRR17642783.bam.bai wildcards: sample=SRR17642783 threads: 10 resources: tmpdir=/tmp

Activating conda environment: ../../miniconda3/envs/a2a5d9f83fe05e456d4c23bc00993556 [Fri May 26 10:03:51 2023] Finished job 2. 8 of 21 steps (38%) done Removing temporary output bwameth/SRR17642783.bam. Removing temporary output bwameth/SRR17642783.bam.bai. Select jobs to execute...

[Fri May 26 10:04:05 2023] rule indexMarkDupes: input: Sambamba/SRR17642783.markdup.bam output: Sambamba/SRR17642783.markdup.bam.bai log: Sambamba/logs/SRR17642783.indexMarkDupes.err, Sambamba/logs/SRR17642783.indexMarkDupes.out jobid: 9 reason: Missing output files: Sambamba/SRR17642783.markdup.bam.bai; Input files updated by another job: Sambamba/SRR17642783.markdup.bam wildcards: sample=SRR17642783 resources: tmpdir=/tmp

Warning: the following output files of rule indexMarkDupes were not present when the DAG was created: {'Sambamba/SRR17642783.markdup.bam.bai'} Activating conda environment: ../../miniconda3/envs/081ae7e483778bba002ffd0e803c7dac [Fri May 26 10:22:17 2023] Finished job 9. 9 of 21 steps (43%) done Select jobs to execute...

[Fri May 26 10:22:17 2023] rule link_deduped_bam: input: Sambamba/SRR17642783.markdup.bam, Sambamba/SRR17642783.markdup.bam.bai output: filtered_bam/SRR17642783.filtered.bam, filtered_bam/SRR17642783.filtered.bam.bai jobid: 1 reason: Missing output files: filtered_bam/SRR17642783.filtered.bam.bai, filtered_bam/SRR17642783.filtered.bam; Input files updated by another job: Sambamba/SRR17642783.markdup.bam, Sambamba/SRR17642783.markdup.bam.bai wildcards: sample=SRR17642783 resources: tmpdir=/tmp

[Fri May 26 10:22:17 2023] Finished job 1. 10 of 21 steps (48%) done Select jobs to execute...

[Fri May 26 10:22:18 2023] rule calc_Mbias: input: filtered_bam/SRR17642783.filtered.bam, filtered_bam/SRR17642783.filtered.bam.bai output: QC_metrics/SRR17642783.Mbias.txt log: QC_metrics/logs/SRR17642783.calc_Mbias.out jobid: 15 reason: Missing output files: QC_metrics/SRR17642783.Mbias.txt; Input files updated by another job: filtered_bam/SRR17642783.filtered.bam.bai, filtered_bam/SRR17642783.filtered.bam wildcards: sample=SRR17642783 threads: 10 resources: tmpdir=/tmp

Activating conda environment: ../../miniconda3/envs/e6ce94c5f4ccaba458228e8445f6d20d

[Fri May 26 10:22:18 2023] rule DepthOfCov: input: filtered_bam/SRR17642783.filtered.bam, filtered_bam/SRR17642783.filtered.bam.bai, QC_metrics/randomCpG.bed output: QC_metrics/CpGCoverage.txt, QC_metrics/CpGCoverage.png, QC_metrics/CpGCoverage.coverageMetrics.txt log: QC_metrics/logs/DepthOfCov.err jobid: 11 reason: Missing output files: QC_metrics/CpGCoverage.coverageMetrics.txt, QC_metrics/CpGCoverage.txt, QC_metrics/CpGCoverage.png; Input files updated by another job: filtered_bam/SRR17642783.filtered.bam.bai, QC_metrics/randomCpG.bed, filtered_bam/SRR17642783.filtered.bam threads: 20 resources: tmpdir=/tmp

Activating conda environment: ../../miniconda3/envs/081ae7e483778bba002ffd0e803c7dac [Fri May 26 10:27:33 2023] Finished job 15. 11 of 21 steps (52%) done Select jobs to execute...

[Fri May 26 10:27:33 2023] rule calcCHHbias: input: filtered_bam/SRR17642783.filtered.bam, filtered_bam/SRR17642783.filtered.bam.bai output: QC_metrics/SRR17642783.CHH.Mbias.txt log: QC_metrics/logs/SRR17642783.calcCHHbias.err jobid: 17 reason: Missing output files: QC_metrics/SRR17642783.CHH.Mbias.txt; Input files updated by another job: filtered_bam/SRR17642783.filtered.bam.bai, filtered_bam/SRR17642783.filtered.bam wildcards: sample=SRR17642783 threads: 10 resources: tmpdir=/tmp

Activating conda environment: ../../miniconda3/envs/e6ce94c5f4ccaba458228e8445f6d20d [Fri May 26 10:29:01 2023] Finished job 11. 12 of 21 steps (57%) done Removing temporary output QC_metrics/randomCpG.bed. Select jobs to execute...

[Fri May 26 10:29:01 2023] rule DepthOfCovGenome: input: filtered_bam/SRR17642783.filtered.bam, filtered_bam/SRR17642783.filtered.bam.bai output: QC_metrics/genomeCoverage.txt, QC_metrics/genomeCoverage.png, QC_metrics/genomeCoverage.coverageMetrics.txt log: QC_metrics/logs/DepthOfCovGenome.err jobid: 10 reason: Missing output files: QC_metrics/genomeCoverage.coverageMetrics.txt, QC_metrics/genomeCoverage.png, QC_metrics/genomeCoverage.txt; Input files updated by another job: filtered_bam/SRR17642783.filtered.bam.bai, filtered_bam/SRR17642783.filtered.bam threads: 20 resources: tmpdir=/tmp

Activating conda environment: ../../miniconda3/envs/081ae7e483778bba002ffd0e803c7dac [Fri May 26 10:32:40 2023] Finished job 10. 13 of 21 steps (62%) done Select jobs to execute...

[Fri May 26 10:32:40 2023] rule get_flagstat: input: filtered_bam/SRR17642783.filtered.bam output: QC_metrics/SRR17642783.flagstat log: QC_metrics/logs/SRR17642783.get_flagstat.err jobid: 18 reason: Missing output files: QC_metrics/SRR17642783.flagstat; Input files updated by another job: filtered_bam/SRR17642783.filtered.bam wildcards: sample=SRR17642783 resources: tmpdir=/tmp

Activating conda environment: ../../miniconda3/envs/081ae7e483778bba002ffd0e803c7dac [Fri May 26 10:32:41 2023] rule methyl_extract: input: filtered_bam/SRR17642783.filtered.bam, filtered_bam/SRR17642783.filtered.bam.bai, QC_metrics/SRR17642783.Mbias.txt output: MethylDackel/SRR17642783_CpG.bedGraph log: MethylDackel/logs/SRR17642783.methyl_extract.err, MethylDackel/logs/SRR17642783.methyl_extract.out jobid: 14 reason: Missing output files: MethylDackel/SRR17642783_CpG.bedGraph; Input files updated by another job: filtered_bam/SRR17642783.filtered.bam.bai, QC_metrics/SRR17642783.Mbias.txt, filtered_bam/SRR17642783.filtered.bam wildcards: sample=SRR17642783 threads: 10 resources: tmpdir=/tmp

Activating conda environment: ../../miniconda3/envs/e6ce94c5f4ccaba458228e8445f6d20d [Fri May 26 10:39:04 2023] Finished job 14. 14 of 21 steps (67%) done Select jobs to execute...

[Fri May 26 10:39:04 2023] rule bedGraphToBigWig: input: MethylDackel/SRR17642783_CpG.bedGraph, /home/user/genome/GRCh38_gencode_release29/genome_fasta/genome.fa.fai output: MethylDackel/SRR17642783_CpG.methylation.bw, MethylDackel/SRR17642783_CpG.coverage.bw log: MethylDackel/logs/SRR17642783_bedGraphToBigWig.stderr jobid: 20 reason: Missing output files: MethylDackel/SRR17642783_CpG.methylation.bw, MethylDackel/SRR17642783_CpG.coverage.bw; Input files updated by another job: MethylDackel/SRR17642783_CpG.bedGraph wildcards: sample=SRR17642783 resources: tmpdir=/tmp

Activating conda environment: ../../miniconda3/envs/081ae7e483778bba002ffd0e803c7dac [Fri May 26 10:39:43 2023] Finished job 17. 15 of 21 steps (71%) done Select jobs to execute...

[Fri May 26 10:39:43 2023] rule conversionRate: input: QC_metrics/SRR17642783.CHH.Mbias.txt output: QC_metrics/SRR17642783.conv.rate.txt log: QC_metrics/logs/SRR17642783.conversionRate.log jobid: 16 reason: Missing output files: QC_metrics/SRR17642783.conv.rate.txt; Input files updated by another job: QC_metrics/SRR17642783.CHH.Mbias.txt wildcards: sample=SRR17642783 resources: tmpdir=/tmp

[Fri May 26 10:39:43 2023] Finished job 16. 16 of 21 steps (76%) done Removing temporary output QC_metrics/SRR17642783.CHH.Mbias.txt. [Fri May 26 10:43:20 2023] Finished job 20. 17 of 21 steps (81%) done [Fri May 26 10:45:34 2023] Finished job 18. 18 of 21 steps (86%) done Select jobs to execute...

[Fri May 26 10:45:34 2023] rule multiQC: input: QC_metrics/SRR17642783.flagstat output: multiQC/multiqc_report.html log: multiQC/multiQC.out, multiQC/multiQC.err jobid: 19 reason: Missing output files: multiQC/multiqc_report.html; Input files updated by another job: QC_metrics/SRR17642783.flagstat resources: tmpdir=/tmp

Activating conda environment: ../../miniconda3/envs/081ae7e483778bba002ffd0e803c7dac

[Fri May 26 10:45:34 2023] rule produceReport: input: MethylDackel/SRR17642783_CpG.bedGraph, QC_metrics/genomeCoverage.txt, QC_metrics/genomeCoverage.png, QC_metrics/genomeCoverage.coverageMetrics.txt, QC_metrics/CpGCoverage.txt, QC_metrics/CpGCoverage.png, QC_metrics/CpGCoverage.coverageMetrics.txt, QC_metrics/SRR17642783.conv.rate.txt, QC_metrics/SRR17642783.Mbias.txt, QC_metrics/SRR17642783.flagstat output: QC_metrics/QC_report.html jobid: 13 reason: Missing output files: QC_metrics/QC_report.html; Input files updated by another job: QC_metrics/genomeCoverage.png, QC_metrics/CpGCoverage.coverageMetrics.txt, QC_metrics/genomeCoverage.coverageMetrics.txt, QC_metrics/CpGCoverage.txt, QC_metrics/SRR17642783.conv.rate.txt, QC_metrics/SRR17642783.Mbias.txt, QC_metrics/CpGCoverage.png, MethylDackel/SRR17642783_CpG.bedGraph, QC_metrics/SRR17642783.flagstat, QC_metrics/genomeCoverage.txt resources: tmpdir=/tmp

Activating conda environment: ../../miniconda3/envs/e6ce94c5f4ccaba458228e8445f6d20d [Fri May 26 10:45:42 2023] Finished job 19. 19 of 21 steps (90%) done Quitting from lines 197-212 (tmpwvy1gkwq.WGBS_QC_report_template.Rmd) Error in rep(1, nrow(V)) : invalid 'times' argument Calls: ... eval -> PCA -> svd.triplet -> as.vector -> crossprod Execution halted [Fri May 26 10:47:49 2023] Error in rule produceReport: jobid: 13 input: MethylDackel/SRR17642783_CpG.bedGraph, QC_metrics/genomeCoverage.txt, QC_metrics/genomeCoverage.png, QC_metrics/genomeCoverage.coverageMetrics.txt, QC_metrics/CpGCoverage.txt, QC_metrics/CpGCoverage.png, QC_metrics/CpGCoverage.coverageMetrics.txt, QC_metrics/SRR17642783.conv.rate.txt, QC_metrics/SRR17642783.Mbias.txt, QC_metrics/SRR17642783.flagstat output: QC_metrics/QC_report.html conda-env: /home/user/miniconda3/envs/e6ce94c5f4ccaba458228e8445f6d20d

RuleException: CalledProcessError in line 302 of /home/user/miniconda3/envs/snakePipes/lib/python3.11/site-packages/snakePipes/shared/rules/WGBS.snakefile: Command 'source /home/user/miniconda3/bin/activate '/home/user/miniconda3/envs/e6ce94c5f4ccaba458228e8445f6d20d'; set -euo pipefail; Rscript --vanilla -e 'rmarkdown::render("/home/user/230523bisulfite_test/output/.snakemake/scripts/tmpwvy1gkwq.WGBS_QC_report_template.Rmd", output_file="/home/user/230523bisulfite_test/output/QC_metrics/QC_report.html", quiet=TRUE, knit_root_dir = "/home/user/230523bisulfite_test/output", params = list(rmd="/home/user/230523bisulfite_test/output/.snakemake/scripts/tmpwvy1gkwq.WGBS_QC_report_template.Rmd"))'' returned non-zero exit status 1. File "/home/user/miniconda3/envs/snakePipes/lib/python3.11/site-packages/snakePipes/shared/rules/WGBS.snakefile", line 302, in __rule_produceReport File "/home/user/miniconda3/envs/snakePipes/lib/python3.11/concurrent/futures/thread.py", line 58, in run Exiting because a job execution failed. Look above for error message Complete log: .snakemake/log/2023-05-24T191728.932387.snakemake.log

!!! ERROR in WGBS workflow! !!!!!!!!!!!!!!!!!!!!!!!!!!!!!!

katsikora commented 1 year ago

Hi @kaznt,

thanks for submitting your error log. I cannot reproduce your issue with snakePipes 2.7.3 on our WGBS test data from zenodo.

Could you try rerunning with that test dataset?

Best wishes,

Katarzyna

kaznt commented 1 year ago

Hi @katsikora,

Thank you for your response. I've tested the samples you mentioned in zenodo, and yes, it passed all analysis without any problem, indicating that the issue is derived from the samples I throw-in. I used publically available bisulfite sequence dataset from SRA, https://www.ncbi.nlm.nih.gov/sra/?term=SRR17642783 Do you have any idea why it fails to in PCA analysis?

Also, I was wondering it is normal or not, but the mapping rate in bisulfite sequence analysis is incredibly poor, such like 0.8% to 1%. I'm not familier with bisulfite analysis, so if it is usual, then it is OK, but if it is not expected result, how I can improve the mapping rate?

Thank you. Best regards,

katsikora commented 1 year ago

Hi Kaznt,

thanks for checking on our sample data. If the mapping rates are so low, are you sure you're mapping to the expected organism? Also, did you dump the files with --split-files ?

Best wishes,

Katarzyna

kaznt commented 1 year ago

Hi @katsikora,

I confirmed the species and re-run with GRCm38 dataset I downloaded from the site here (https://zenodo.org/record/4468065). Not sure what does --split-files mean, but I listed the md5 hash below.

ff817f24b945f9bd7cd5558b213b8679 SRX202087_R1.fastq.gz 05998326f8083072b1ca3db14b10e976 SRX202087_R2.fastq.gz 3337cf4890af2944430cacd6bf90304a SRX202088_R1.fastq.gz c9c63d241ba20431edbde8716b2eb209 SRX202088_R2.fastq.gz d4b1eb33c1a10ab245bce2ba504c3fd4 SRX271141_R1.fastq.gz dfd414db40067ef4e3f2d33045c79dea SRX271141_R2.fastq.gz d2b31698889055fd49ee131c4161095c SRX271142_R1.fastq.gz 5fdbb6cc94b3d3e4cb5f17364c239d76 SRX271142_R2.fastq.gz

4e004c5032ade9aa37cc795890051f7c genome.fa (GRCm38)

I attached the map rate.

スクリーンショット 2023-06-08 10 54 04

Command I issue is below. $ WGBS -i input -o output GRCm38_gencode_release19 --local -j 30

Best regards,

kaznt commented 1 year ago

Additionally, when I mapped manually by using bwameth with SRX202087 fastq files above, map rate is >99% indicating that the dataset is fine. I wonder if the pipeline refers genome.fa instead of genome.fa.bwameth.c2t

Best,

kaznt commented 1 year ago

I've checked bwameth log. Seems reffering c2t fa though...


Found BWA MEM index running: /home/user/miniconda3/envs/e6ce94c5f4ccaba458228e8445f6d20d/bin/python /home/user/miniconda3/envs/e6ce94c5f4ccaba458228e8445f6d20d/bin/bwameth.py c2t FASTQ/SRX202087_R1.fastq.gz FASTQ/SRX202087_R2.fastq.gz |bwa mem -T 40 -B 2 -L 10 -CM -U 100 -p -R '@RG\tID:SRX202087_R\tSM:SRX202087_R' -t 20 /home/user/genome/GRCm38_gencode_release19/BWAmethIndex/genome.fa.bwameth.c2t /dev/stdin

converting reads in FASTQ/SRX202087_R1.fastq.gz,FASTQ/SRX202087_R2.fastq.gz [M::bwa_idx_load_from_disk] read 0 ALT contigs WARNING: 20833 reads with length < 80 : this program is designed for long reads [M::process] read 549664 sequences (53699076 bp)... [M::process] 0 single-end sequences; 549664 paired-end sequences [M::mem_pestat] # candidate unique pairs for (FF, FR, RF, RR): (0, 217688, 2, 0) [M::mem_pestat] skip orientation FF as there are not enough pairs [M::mem_pestat] analyzing insert size distribution for orientation FR... [M::mem_pestat] (25, 50, 75) percentile: (162, 186, 214) [M::mem_pestat] low and high boundaries for computing mean and std.dev: (58, 318) [M::mem_pestat] mean and std.dev: (189.55, 39.70) [M::mem_pestat] low and high boundaries for proper pairs: (6, 370) [M::mem_pestat] skip orientation RF as there are not enough pairs [M::mem_pestat] skip orientation RR as there are not enough pairs [M::mem_process_seqs] Processed 549664 reads in 308.283 CPU sec, 15.830 real sec [main] Version: 0.7.17-r1188 [main] CMD: bwa mem -T 40 -B 2 -L 10 -CM -U 100 -p -R @RG\tID:SRX202087_R\tSM:SRX202087_R -t 20 /home/user/genome/GRCm38_gencode_release19/BWAmethIndex/genome.fa.bwameth.c2t /dev/stdin [main] Real time: 33.293 sec; CPU: 314.966 sec se1] unrecognized reference name ""; treated as unmapped [W::sam_parse1] unrecognized reference name ""; treated as unmapped [W::sam_parse1] unrecognized reference name ""; treated as unmapped [W::sam_parse1] unrecognized reference name ""; treated as unmapped [W::sam_parse1] unrecognized reference name ""; treated as unmapped [W::sam_parse1] unrecognized reference name ""; treated as unmapped [W::sam_parse1] unrecognized reference name ""; treated as unmapped [W::sam_parse1] unrecognized reference name ""; treated as unmapped [W::sam_parse1] unrecognized reference name ""; treated as unmapped [W::sam_parse1] unrecognized reference name ""; treated as unmapped [W::sam_parse1] unrecognized reference name ""; treated as unmapped [W::sam_parse1] unrecognized reference name ""; treated as unmapped [W::sam_parse1] unrecognized reference name ""; treated as unmapped [W::sam_parse1] unrecognized reference name ""; treated as unmapped [W::sam_parse1] unrecognized reference name ""; treated as unmapped [W::sam_parse1] unrecognized reference name ""; treated as unmapped [W::sam_parse1] unrecognized reference name ""; treated as unmapped [W::sam_parse1] unrecognized reference name ""; treated as unmapped [W::sam_parse1] unrecognized reference name ""; treated as unmapped [W::sam_parse1] unrecognized reference name ""; treated as unmapped [W::sam_parse1] unrecognized reference name ""; treated as unmapped [W::sam_parse1] unrecognized reference name ""; treated as unmapped [W::sam_parse1] unrecognized reference name ""; treated as unmapped [W::sam_parse1] unrecognized reference name ""; treated as unmapped [W::sam_parse1] unrecognized reference name ""; treated as unmapped [W::sam_parse1] unrecognized reference name ""; treated as unmapped [W::sam_parse1] unrecognized reference name ""; treated as unmapped [W::sam_parse1] unrecognized reference name ""; treated as unmapped [W::sam_parse1] unrecognized reference name ""; treated as unmapped [W::sam_parse1] unrecognized reference name ""; treated as unmapped [W::sam_parse1] unrecognized reference name ""; treated as unmapped [W::sam_parse1] unrecognized reference name ""; treated as unmapped [W::sam_parse1] unrecognized reference name ""; treated as unmapped [W::sam_parse1] unrecognized reference name ""; treated as unmapped [W::sam_parse1] unrecognized reference name ""; treated as unmapped [W::sam_parse1] unrecognized reference name ""; treated as unmapped [bam_sort_core] merging from 0 files and 4 in-memory blocks...

Best,

katsikora commented 1 year ago

Hi Kaznt,

thanks for looking into the details. Indeed it looks like a potential issue with bwameth. Either it's the parsing of chromosome names by bwameth (as samtools is complaining about empty strings in chromosome names), or indeed the reads were unmapped to start with. Can you paste the command that you used to map manually and that returned high mapping rates? Did you use the same reference index and the same bwameth version as for snakePipes? I'll try to reproduce the error on your dataset then.

Best wishes,

Katarzyna

kaznt commented 1 year ago

Hi katsikora,

Sorry for a bit delay for my response. I paste the comand and output below.

<As I shown below, I use independently prepared bwameth via conda> conda create -n bwameth -c bioconda bwameth conda activate bwameth bwameth.py --version bwa-meth.py 0.2.7

ln -s ../input/ . ### It is preparing input fastq files as to use same files of input for snakepipes ln -s ~/data/genome/GRCm38_gencode_release19/BWAmethIndex/ . ### It is preparing same reference and index I download from zenodo server bwameth.py --reference genome.fa SRX202087_R1.fastq.gz SRX202087_R2.fastq.gz > output.sam samtools view -Sb output.sam > output.bam samtools sort output.bam > output.sort.bam samtools flagstat output.sort.bam

533342 + 16516 in total (QC-passed reads + QC-failed reads) 533192 + 16472 primary 150 + 44 secondary 0 + 0 supplementary 0 + 0 duplicates 0 + 0 primary duplicates 533059 + 16516 mapped (99.95% : 100.00%) 532909 + 16472 primary mapped (99.95% : 100.00%) 533192 + 16472 paired in sequencing 266596 + 8236 read1 266596 + 8236 read2 531192 + 0 properly paired (99.62% : 0.00%) 532666 + 16472 with itself and mate mapped 243 + 0 singletons (0.05% : 0.00%) 22 + 2 with mate mapped to a different chr 9 + 1 with mate mapped to a different chr (mapQ>=5)

Best regards,

katsikora commented 1 year ago

Hi Kaznt,

thanks for providing your command, it all looks good.

I rerun WGBS with the dataset you pointed to (https://www.ncbi.nlm.nih.gov/sra/?term=SRR17642783 ). I aligned to the human genome and obtained a mapping rate of 99.54%. In the case where you observed the 1% mapping rates, there is definitely something wrong. My vote would go either to the reads (library prep not WGBS?), or then perhaps the genome doesn't match, or there's something about the bwameth index. I'll double check the genome index we're providing on zenodo.

Independently of that, I could reproduce your original error with the produce report rule:

Quitting from lines 197-212 [unnamed-chunk-10] (tmpqedjdto9.WGBS_QC_report_template.Rmd)
Error in `rep()`:
! invalid 'times' argument
Backtrace:
 1. FactoMineR::PCA(t(methyl[IDX, ]), graph = FALSE)
 2. FactoMineR::svd.triplet(X, row.w = row.w, col.w = col.w, ncp = ncp)
 4. base::crossprod(rep(1, nrow(V)), as.matrix(V))

I suspect it's because I only submitted one sample and then the PCA plot (understandably) failed. I'll add some code that will skip PCA when there is only one sample, but the other QC metrics will be returned.

I hope this helps,

Best wishes,

Katarzyna

kaznt commented 1 year ago

Thank you for your help.

I'll keep this topic open until when you check the zenodo reference.

Best,

katsikora commented 1 year ago

Hi Kaznt,

I get

142236831 + 6645913 mapped (94.89% : 99.96%)

when mapping FASTQ/SRR610039_R1.fastq.gz, FASTQ/SRR610039_R2.fastq.gz to our GRCm38_gencode_release19 from zenodo.

In that sense, I cannot reproduce your observation related to mapping rates <1%.

Best wishes,

Katarzyna

katsikora commented 1 year ago

I'll close the issue for now, please feel free to reopen if needed.