Closed hanrong498 closed 5 months ago
I think the problem is that the reference.fa is missing. I am trying to manually re-write the organism.yaml file
$ cat /work/upzenk/miniconda3/envs/snakePipes/lib/python3.11/site-packages/snakePipes/shared/organisms/GRCh38_v39_lambda_spikein.yaml
blacklist_bed: ''
bowtie2_index: /scratch/hhu/index_snakepipes/human_GRCh38_v39_lambda_spikein/BowtieIndex/genome
bwa_index: /scratch/hhu/index_snakepipes/human_GRCh38_v39_lambda_spikein/BWAIndex/genome.fa
bwa_mem2_index: /scratch/hhu/index_snakepipes/human_GRCh38_v39_lambda_spikein/BWA-MEM2Index/genome.fa
bwameth2_index: /scratch/hhu/index_snakepipes/human_GRCh38_v39_lambda_spikein/BWAmeth2Index/genome.fa
bwameth_index: /scratch/hhu/index_snakepipes/human_GRCh38_v39_lambda_spikein/BWAmethIndex/genome.fa
All the indexes were generated using the createIndices
pipeline. I have a naive question: among all the files in the *Index/
folder, do we always put the /genome.fa
here? And if I understand it correctly, /genome.fa here is actually a symlink to genome_fasta/genome.fa
?
Hi,
Here I fixed the reference genome issue but still have problems running the WGBS pipeline. In the .err
, the bwameth was run multiple times and the mapped files seem to not have a reference name:
[M::process] 0 single-end sequences; 1818182 paired-end sequences
[M::process] read 1818182 sequences (120000012 bp)...
[M::mem_pestat] # candidate unique pairs for (FF, FR, RF, RR): (36, 178823, 11466, 10)
[M::mem_pestat] analyzing insert size distribution for orientation FF...
[M::mem_pestat] (25, 50, 75) percentile: (117, 362, 1187)
[M::mem_pestat] low and high boundaries for computing mean and std.dev: (1, 3327)
[M::mem_pestat] mean and std.dev: (635.83, 645.35)
[M::mem_pestat] low and high boundaries for proper pairs: (1, 4397)
[M::mem_pestat] analyzing insert size distribution for orientation FR...
[M::mem_pestat] (25, 50, 75) percentile: (83, 119, 169)
[M::mem_pestat] low and high boundaries for computing mean and std.dev: (1, 341)
[M::mem_pestat] mean and std.dev: (126.55, 60.59)
[M::mem_pestat] low and high boundaries for proper pairs: (1, 427)
[M::mem_pestat] analyzing insert size distribution for orientation RF...
[M::mem_pestat] (25, 50, 75) percentile: (113, 314, 722)
[M::mem_pestat] low and high boundaries for computing mean and std.dev: (1, 1940)
[M::mem_pestat] mean and std.dev: (437.26, 432.76)
[M::mem_pestat] low and high boundaries for proper pairs: (1, 2549)
[M::mem_pestat] analyzing insert size distribution for orientation RR...
[M::mem_pestat] (25, 50, 75) percentile: (119, 633, 788)
[M::mem_pestat] low and high boundaries for computing mean and std.dev: (1, 2126)
[M::mem_pestat] mean and std.dev: (557.10, 407.29)
[M::mem_pestat] low and high boundaries for proper pairs: (1, 2795)
[M::mem_pestat] skip orientation FF
[M::mem_pestat] skip orientation RR
[M::mem_process_seqs] Processed 1818182 reads in 5503.742 CPU sec, 773.964 real sec
[M::process] 0 single-end sequences; 1818182 paired-end sequences
[M::process] read 1818182 sequences (120000012 bp)...
[M::mem_pestat] # candidate unique pairs for (FF, FR, RF, RR): (33, 178223, 11445, 7)
[M::mem_pestat] analyzing insert size distribution for orientation FF...
[M::mem_pestat] (25, 50, 75) percentile: (109, 216, 777)
[M::mem_pestat] low and high boundaries for computing mean and std.dev: (1, 2113)
[M::mem_pestat] mean and std.dev: (373.16, 372.35)
[M::mem_pestat] low and high boundaries for proper pairs: (1, 2781)
[M::mem_pestat] analyzing insert size distribution for orientation FR...
[M::mem_pestat] (25, 50, 75) percentile: (83, 119, 169)
[M::mem_pestat] low and high boundaries for computing mean and std.dev: (1, 341)
[M::mem_pestat] mean and std.dev: (126.27, 60.50)
[M::mem_pestat] low and high boundaries for proper pairs: (1, 427)
[M::mem_pestat] analyzing insert size distribution for orientation RF...
[M::mem_pestat] (25, 50, 75) percentile: (109, 309, 722)
[M::mem_pestat] low and high boundaries for computing mean and std.dev: (1, 1948)
[M::mem_pestat] mean and std.dev: (437.98, 439.44)
[M::mem_pestat] low and high boundaries for proper pairs: (1, 2561)
[M::mem_pestat] skip orientation RR as there are not enough pairs
[M::mem_pestat] skip orientation FF
rse1] unrecognized reference name ""; treated as unmapped
[W::sam_parse1] unrecognized reference name ""; treated as unmapped
[W::sam_parse1] unrecognized reference name ""; treated as unmapped
[W::sam_parse1] unrecognized reference name ""; treated as unmapped
[W::sam_parse1] unrecognized reference name ""; treated as unmapped
[W::sam_parse1] unrecognized reference name ""; treated as unmapped
This is not the complete log. The bwamem was executed > 10 times and there are more [W::sam_parse1] unrecognized reference name ""; treated as unmapped
message below.
Can you help me with it? Thanks!
This issue of [W::sam_parse1] unrecognized reference name ""; treated as unmapped
from bwameth.py
was mentioned here
As I have just checked, the reference I generated from bwa was also rchr1 1
, which probably makes it unrecognisable.
$ head genome.fa.bwameth.c2t
>rchr1 1
Same issue was mentioned previously #898
Maybe there is a way to work around it?
Hi @hanrong498,
thanks for looking into the details of the issue. Indeed, I've seen #898, and I wasn't able to reproduce it here. Unless something got messed up with bwa-meth or another tool on conda since we've built and tested the release. Did you build your conda environments for snakePipes recently? I can try if I can reproduce it with a freshly made conda env.
Best wishes,
Katarzyna
Hi Katarzyna,
Thank you for the reply!
Yes I recently build the conda environments for snakePipes (~2 weeks ago).
I first ran the createIndices pipeline for a human_lambda hybrid genome. At first I thought this rchr1 1
was caused by running bwameth
so recently. However, when I checked the pre-made indices provided in zenodo, the genome.fa.bwameth.c2t
also started with rchr1 1
. I have not yet tried running WGBS with the pre-made indices yet.
I also saw this issue running bwameth.py
mentioned in bwa-meth, but I think it has been solved.
Lastly, I thought the unrecognised reference name
is what made WGBS pipeline 'stucked'. However, when I subset my original data to 1000 reads, this process can actually proceed, even though I still got unrecognised reference name
error in the log file.
PS: when running the subsetted data, I encountered an error in produceReport and I reported it yesterday #938 .
Thanks a lot! Hanrong
Hi Hanrong,
if your pipeline was getting stuck with the full read set but is running with a subset of reads, you might need to assign more memory to the jobs. Are you running on a cluster or with --local
?
Best wishes,
Katarzyna
Hi Katarzyna,
I am running it on one compute node on the cluster. Indeed in the first run, when I used the full read set, I did not assign enough memory. I will try again with this configuration:
#SBATCH --nodes 1
#SBATCH --ntasks 60
#SBATCH --mem 400G
In this node, there are ~7GB ram per core. Then in the pipeline, I will set --local -j 60
. I hope this could finish the run :)
But as I am new to WGBS analysis, I wonder if this unrecognised reference name
error, even though will not kill the process, can cause any downstream analysis problem?
Thanks & best regards! Hanrong
Hi Katarzyna,
With this new mem allocation I can finish the bwa mem.
Additionally, I will put here a small issue I had in sambamba-markdup:
sambamba-markdup: Cannot open file `/tmp//snakepipes.QZfyXB38ey/211029_1_FZ_WGBS_HOIK_p35_rep1_S1_L001/sambamba-pid1654661-markdup-jlul/PairedEndsInfoftfn19' in mode `w+' (Too many open files)
This can be solved by changing ulimit -n <large number>
as suggested here.
However I encountered the same problem ! invalid 'times' argument
as #898 . Specifically, I have the same PCA error:
Quitting from lines 197-212 [unnamed-chunk-10] (tmpxprjy2ka.WGBS_QC_report_template.Rmd)
Error in `rep()`:
! invalid 'times' argument
Backtrace:
1. FactoMineR::PCA(t(methyl[IDX, ]), graph = FALSE)
2. FactoMineR::svd.triplet(X, row.w = row.w, col.w = col.w, ncp = ncp)
4. base::crossprod(rep(1, nrow(V)), as.matrix(V))
I am running on one sample, understandably I have the same issue. In the previous issue, you mentioned you add some code to skip this PCA plot. Is this implemented as an option in the snakePipes now or that I would need to change it myself?
And a related question: the next step I would indeed analyse multiple samples. My current folder structure is:
/path/to/sample1/sample1_R1.fastq.gz, sample1_R2.fastq.gz,
/path/to/sample2/sample2_R1.fastq.gz, sample2_R2.fastq.gz
To dump all samples to the pipeline, I would need to put all fastq.gz files into one folder? If so, can symlink be enough?
Thank you very much for your help! Hanrong
unrecognised reference name
Hi Hanrong,
what does the header of the bam file coming out of the mapping look like regarding the reference sequences ?
samtools view -H $bam | grep @SQ
Best wishes,
Katarzyna
Hi Katarzyna,
With this new mem allocation I can finish the bwa mem.
Additionally, I will put here a small issue I had in sambamba-markdup:
sambamba-markdup: Cannot open file `/tmp//snakepipes.QZfyXB38ey/211029_1_FZ_WGBS_HOIK_p35_rep1_S1_L001/sambamba-pid1654661-markdup-jlul/PairedEndsInfoftfn19' in mode `w+' (Too many open files)
This can be solved by changing
ulimit -n <large number>
as suggested here.However I encountered the same problem
! invalid 'times' argument
as #898 . Specifically, I have the same PCA error:Quitting from lines 197-212 [unnamed-chunk-10] (tmpxprjy2ka.WGBS_QC_report_template.Rmd) Error in `rep()`: ! invalid 'times' argument Backtrace: 1. FactoMineR::PCA(t(methyl[IDX, ]), graph = FALSE) 2. FactoMineR::svd.triplet(X, row.w = row.w, col.w = col.w, ncp = ncp) 4. base::crossprod(rep(1, nrow(V)), as.matrix(V))
I am running on one sample, understandably I have the same issue. In the previous issue, you mentioned you add some code to skip this PCA plot. Is this implemented as an option in the snakePipes now or that I would need to change it myself?
And a related question: the next step I would indeed analyse multiple samples. My current folder structure is:
/path/to/sample1/sample1_R1.fastq.gz, sample1_R2.fastq.gz, /path/to/sample2/sample2_R1.fastq.gz, sample2_R2.fastq.gz
To dump all samples to the pipeline, I would need to put all fastq.gz files into one folder? If so, can symlink be enough?
Thank you very much for your help! Hanrong
Hi Hanrong,
it looks like you're mapping large amounts of data, and sambamba chunked your large file into many small ones for marking duplicates. Then you hit the number of open files allowed by your system admin.
You can alleviate this issue by increasing the memory assigned to rule markDupes
.
The other issue you mention - invalid 'times' argument
is because you're running only one sample, and PCA cannot be computed. The fix will be available in the next release.
Can you edit this temporary file tmpxprjy2ka.WGBS_QC_report_template.Rmd
and add eval=FALSE
to the PCA Rmd section?
It now reads:
{r warning=FALSE, message=FALSE}
If you change it to:
{r warning=FALSE, message=FALSE,eval=FALSE}
the code chunk won't be executed and the report creation should run through.
Indeed, if you would like to process multiple files at the same time, it's best to symlink them into one input folder.
e.g. ln -t $my_folder_with_symlinks -s $file1 $file2 $file3 ...
Best wishes,
Katarzyna
unrecognised reference name
Hi Hanrong,
what does the header of the bam file coming out of the mapping look like regarding the reference sequences ?
samtools view -H $bam | grep @SQ
Best wishes,
Katarzyna
Hi Katarzyna,
Here are the results:
[hhu@jed Sambamba]$ samtools view 211029_1_FZ_WGBS_HOIK_p35_rep1_S1_L001.markdup.bam -H | grep @SQ
@SQ SN:chr1 LN:248956422
@SQ SN:chr2 LN:242193529
@SQ SN:chr3 LN:198295559
@SQ SN:chr4 LN:190214555
@SQ SN:chr5 LN:181538259
@SQ SN:chr6 LN:170805979
@SQ SN:chr7 LN:159345973
@SQ SN:chr8 LN:145138636
@SQ SN:chr9 LN:138394717
@SQ SN:chr10 LN:133797422
@SQ SN:chr11 LN:135086622
@SQ SN:chr12 LN:133275309
@SQ SN:chr13 LN:114364328
@SQ SN:chr14 LN:107043718
@SQ SN:chr15 LN:101991189
@SQ SN:chr16 LN:90338345
@SQ SN:chr17 LN:83257441
@SQ SN:chr18 LN:80373285
@SQ SN:chr19 LN:58617616
@SQ SN:chr20 LN:64444167
@SQ SN:chr21 LN:46709983
@SQ SN:chr22 LN:50818468
@SQ SN:chrX LN:156040895
@SQ SN:chrY LN:57227415
@SQ SN:chrM LN:16569
@SQ SN:GL000008.2 LN:209709
@SQ SN:GL000009.2 LN:201709
@SQ SN:GL000194.1 LN:191469
@SQ SN:GL000195.1 LN:182896
@SQ SN:GL000205.2 LN:185591
@SQ SN:GL000208.1 LN:92689
@SQ SN:GL000213.1 LN:164239
@SQ SN:GL000214.1 LN:137718
@SQ SN:GL000216.2 LN:176608
@SQ SN:GL000218.1 LN:161147
@SQ SN:GL000219.1 LN:179198
@SQ SN:GL000220.1 LN:161802
@SQ SN:GL000221.1 LN:155397
@SQ SN:GL000224.1 LN:179693
@SQ SN:GL000225.1 LN:211173
@SQ SN:GL000226.1 LN:15008
@SQ SN:KI270302.1 LN:2274
@SQ SN:KI270303.1 LN:1942
@SQ SN:KI270304.1 LN:2165
@SQ SN:KI270305.1 LN:1472
@SQ SN:KI270310.1 LN:1201
@SQ SN:KI270311.1 LN:12399
@SQ SN:KI270312.1 LN:998
@SQ SN:KI270315.1 LN:2276
@SQ SN:KI270316.1 LN:1444
@SQ SN:KI270317.1 LN:37690
@SQ SN:KI270320.1 LN:4416
@SQ SN:KI270322.1 LN:21476
@SQ SN:KI270329.1 LN:1040
@SQ SN:KI270330.1 LN:1652
@SQ SN:KI270333.1 LN:2699
@SQ SN:KI270334.1 LN:1368
@SQ SN:KI270335.1 LN:1048
@SQ SN:KI270336.1 LN:1026
@SQ SN:KI270337.1 LN:1121
@SQ SN:KI270338.1 LN:1428
@SQ SN:KI270340.1 LN:1428
@SQ SN:KI270362.1 LN:3530
@SQ SN:KI270363.1 LN:1803
@SQ SN:KI270364.1 LN:2855
@SQ SN:KI270366.1 LN:8320
@SQ SN:KI270371.1 LN:2805
@SQ SN:KI270372.1 LN:1650
@SQ SN:KI270373.1 LN:1451
@SQ SN:KI270374.1 LN:2656
@SQ SN:KI270375.1 LN:2378
@SQ SN:KI270376.1 LN:1136
@SQ SN:KI270378.1 LN:1048
@SQ SN:KI270379.1 LN:1045
@SQ SN:KI270381.1 LN:1930
@SQ SN:KI270382.1 LN:4215
@SQ SN:KI270383.1 LN:1750
@SQ SN:KI270384.1 LN:1658
@SQ SN:KI270385.1 LN:990
@SQ SN:KI270386.1 LN:1788
@SQ SN:KI270387.1 LN:1537
@SQ SN:KI270388.1 LN:1216
@SQ SN:KI270389.1 LN:1298
@SQ SN:KI270390.1 LN:2387
@SQ SN:KI270391.1 LN:1484
@SQ SN:KI270392.1 LN:971
@SQ SN:KI270393.1 LN:1308
@SQ SN:KI270394.1 LN:970
@SQ SN:KI270395.1 LN:1143
@SQ SN:KI270396.1 LN:1880
@SQ SN:KI270411.1 LN:2646
@SQ SN:KI270412.1 LN:1179
@SQ SN:KI270414.1 LN:2489
@SQ SN:KI270417.1 LN:2043
@SQ SN:KI270418.1 LN:2145
@SQ SN:KI270419.1 LN:1029
@SQ SN:KI270420.1 LN:2321
@SQ SN:KI270422.1 LN:1445
@SQ SN:KI270423.1 LN:981
@SQ SN:KI270424.1 LN:2140
@SQ SN:KI270425.1 LN:1884
@SQ SN:KI270429.1 LN:1361
@SQ SN:KI270435.1 LN:92983
@SQ SN:KI270438.1 LN:112505
@SQ SN:KI270442.1 LN:392061
@SQ SN:KI270448.1 LN:7992
@SQ SN:KI270465.1 LN:1774
@SQ SN:KI270466.1 LN:1233
@SQ SN:KI270467.1 LN:3920
@SQ SN:KI270468.1 LN:4055
@SQ SN:KI270507.1 LN:5353
@SQ SN:KI270508.1 LN:1951
@SQ SN:KI270509.1 LN:2318
@SQ SN:KI270510.1 LN:2415
@SQ SN:KI270511.1 LN:8127
@SQ SN:KI270512.1 LN:22689
@SQ SN:KI270515.1 LN:6361
@SQ SN:KI270516.1 LN:1300
@SQ SN:KI270517.1 LN:3253
@SQ SN:KI270518.1 LN:2186
@SQ SN:KI270519.1 LN:138126
@SQ SN:KI270521.1 LN:7642
@SQ SN:KI270522.1 LN:5674
@SQ SN:KI270528.1 LN:2983
@SQ SN:KI270529.1 LN:1899
@SQ SN:KI270530.1 LN:2168
@SQ SN:KI270538.1 LN:91309
@SQ SN:KI270539.1 LN:993
@SQ SN:KI270544.1 LN:1202
@SQ SN:KI270548.1 LN:1599
@SQ SN:KI270579.1 LN:31033
@SQ SN:KI270580.1 LN:1553
@SQ SN:KI270581.1 LN:7046
@SQ SN:KI270582.1 LN:6504
@SQ SN:KI270583.1 LN:1400
@SQ SN:KI270584.1 LN:4513
@SQ SN:KI270587.1 LN:2969
@SQ SN:KI270588.1 LN:6158
@SQ SN:KI270589.1 LN:44474
@SQ SN:KI270590.1 LN:4685
@SQ SN:KI270591.1 LN:5796
@SQ SN:KI270593.1 LN:3041
@SQ SN:KI270706.1 LN:175055
@SQ SN:KI270707.1 LN:32032
@SQ SN:KI270708.1 LN:127682
@SQ SN:KI270709.1 LN:66860
@SQ SN:KI270710.1 LN:40176
@SQ SN:KI270711.1 LN:42210
@SQ SN:KI270712.1 LN:176043
@SQ SN:KI270713.1 LN:40745
@SQ SN:KI270714.1 LN:41717
@SQ SN:KI270715.1 LN:161471
@SQ SN:KI270716.1 LN:153799
@SQ SN:KI270717.1 LN:40062
@SQ SN:KI270718.1 LN:38054
@SQ SN:KI270719.1 LN:176845
@SQ SN:KI270720.1 LN:39050
@SQ SN:KI270721.1 LN:100316
@SQ SN:KI270722.1 LN:194050
@SQ SN:KI270723.1 LN:38115
@SQ SN:KI270724.1 LN:39555
@SQ SN:KI270725.1 LN:172810
@SQ SN:KI270726.1 LN:43739
@SQ SN:KI270727.1 LN:448248
@SQ SN:KI270728.1 LN:1872759
@SQ SN:KI270729.1 LN:280839
@SQ SN:KI270730.1 LN:112551
@SQ SN:KI270731.1 LN:150754
@SQ SN:KI270732.1 LN:41543
@SQ SN:KI270733.1 LN:179772
@SQ SN:KI270734.1 LN:165050
@SQ SN:KI270735.1 LN:42811
@SQ SN:KI270736.1 LN:181920
@SQ SN:KI270737.1 LN:103838
@SQ SN:KI270738.1 LN:99375
@SQ SN:KI270739.1 LN:73985
@SQ SN:KI270740.1 LN:37240
@SQ SN:KI270741.1 LN:157432
@SQ SN:KI270742.1 LN:186739
@SQ SN:KI270743.1 LN:210658
@SQ SN:KI270744.1 LN:168472
@SQ SN:KI270745.1 LN:41891
@SQ SN:KI270746.1 LN:66486
@SQ SN:KI270747.1 LN:198735
@SQ SN:KI270748.1 LN:93321
@SQ SN:KI270749.1 LN:158759
@SQ SN:KI270750.1 LN:148850
@SQ SN:KI270751.1 LN:150742
@SQ SN:KI270752.1 LN:27745
@SQ SN:KI270753.1 LN:62944
@SQ SN:KI270754.1 LN:40191
@SQ SN:KI270755.1 LN:36723
@SQ SN:KI270756.1 LN:79590
@SQ SN:KI270757.1 LN:71251
@SQ SN:J02459.1_spikein LN:48502
Best, Hanrong
The other issue you mention -
invalid 'times' argument
is because you're running only one sample, and PCA cannot be computed. The fix will be available in the next release. Can you edit this temporary filetmpxprjy2ka.WGBS_QC_report_template.Rmd
and addeval=FALSE
to the PCA Rmd section? It now reads:{r warning=FALSE, message=FALSE}
If you change it to:{r warning=FALSE, message=FALSE,eval=FALSE}
the code chunk won't be executed and the report creation should run through.
Hi Katarzyna,
Thank you for the tip! And sorry for my late reply regarding this issue. I cannot find this tmpxprjy2ka.WGBS_QC_report_template.Rmd
file.
I tried here /scratch/hhu/20220330_P1886_FIDES_DNA_METHYLATION/merged_processed/.snakemake/scripts/tmpxprjy2ka.WGBS_QC_report_template.Rmd
but this /scripts/
folder is empty. Maybe I can change from the source code?
Thanks! Hanrong
Hi Hanrong,
sure, you can also modify the source file. The updated Rmd is now available on the develop
branch of snakePipes .
You should be able to find the "source" Rmd under:
$github_repo_dir/snakePipes/shared/rscripts/WGBS_QC_report_template.Rmd
and replace it e.g. with the one from the develop branch.
Hope this helps, Best,
Katarzyna
This should now be fixed in snakePipes 2.8.0 and 2.8.1.
Hi!
Sorry for the spamming :)
I am trying to run the WGBS pipeline but got the error in
bwameth/logs/211029_1_FZ_WGBS_HOIK_p35_rep1_S1_L001.map_reads.err
:The shell command from
WGBS_run
iswhen I activate this env I have the samtools:
Can you help me with that? Thanks!