Closed zhanwen-cheng closed 4 years ago
Thanks, @zhanwen-cheng, for including the logs. That's always helpful.
Before I answer your specific questions, I should point out that the instructions in the README are out of date. I modified the original workflow to be runnable in one step, but I forgot to update the instructions. You should be able to run the whoe workflow in one command with:
snakemake --use-conda -p -j <nproc> -r
However, running in pieces SHOULD still work, so let's get into your questions:
First of all, I could execute step 1 command, but it seemed to fail to create unmap kmer freq map in "5 of 11 steps", with other 10 steps runing sucessfully.
This is a warning that can be ignored. The message starts with:
/home/chengzw/software/DTR-phage-pipeline/.snakemake/conda/5443775e/lib/python3.6/site-packages/umap/umap_.py:349: NumbaWarning:
Compilation is falling back to object mode WITH looplifting enabled because Function "fuzzy_simplicial_set" failed type inference due to: Untyped global name '
The failed method probably would have been faster, but it seems to have worked OK.
Secondly, I couldn't execute step 3 as it reported that 'NameError: name 'expand_teamplte_from_bins' is not defined'. I am not sure whether this has something to do with the error in step1.
This is the heart of your problem. It looks like there is a typo in code when running in separate steps. The missing name (expand_teamplte_from_bins) is a misspelling of ### expand_template_from_bins. It appears in line 270 of the main Snakefile:
bin_fasta=lambda w: expand_teamplte_from_bins(w, BIN_FASTA),
Fixing the spelling there would solve this problem. You can also avoid this bug by running the whole workflow in one command. Instead of:
snakemake --use-conda -p -j <nproc> -r <rule>
... simply run it without any rule to get the whole workflow:
snakemake --use-conda -p -j <nproc> -r
Thirdly, in each reason of step 1 log file, there is always "reason: Missing output files:", is this normal?
Yes, the "Missing output files" message is normal. Snakemake's "-r" flag causes it to report the reason each rule was run. The first time you run a workflow, all the output files will be missing. You can simply remove the '-r' from the snakemake command to suppress these messages.
Anyway, I'm sorry the workflow is buggy. Let us know if the fix above doesn't work for you.
OK, thank you so much for your replying and it really helped me with this issue. However, I met another issue in step 5 with "No values given for wildcard 'bin_clust_id'.". I checked the Snakefile and kmer_bins.smk file, but still couldn't figure it out. step5.log
Try running the workflow with the rule all
instead of using the steps. The step-by-step approach to the workflow is buggy in it's current form.
snakemake --use-conda -p -j <nproc> -r all
or just this (since the all
rule is the default):
snakemake --use-conda -p -j <nproc> -r
The specific problem this time is that the step 5 rule is calling expand_template_from_bins
:
input:
ALN_CLUST_READS_COMBO,
lambda w: expand_template_from_bins(w, BIN_CLUSTER_REF_READ_FASTA),
lambda w: expand_template_from_bins(w, BIN_CLUSTER_POL_READS_FASTA),
lambda w: expand_template_from_bins(w, DTR_ALIGN_COORD_PLOT),
lambda w: expand_template_from_bins(w, \
BIN_CLUSTER_POLISHED_REF_PRODIGAL_TXT),
lambda w: expand_template_from_bins(w, \
BIN_CLUSTER_POLISHED_REF_PRODIGAL_STATS),
lambda w: expand_template_from_bins(w, \
BIN_CLUSTER_POLISHED_POL_VS_REF_STRANDS),
lambda w: expand_template_from_bins(w, \
BIN_CLUSTER_POLISHED_POL_VS_REF_STRAND_ANNOTS),
while it should be calling expand_template_from_bin_clusters
as is done in the all
rule:
rule all:
input:
. . .
lambda w: expand_template_from_bin_clusters(w, BIN_CLUSTER_REF_READ_FASTA),
lambda w: expand_template_from_bin_clusters(w, BIN_CLUSTER_POL_READS_FASTA),
lambda w: expand_template_from_bin_clusters(w, DTR_ALIGN_COORD_PLOT),
lambda w: expand_template_from_bin_clusters(w, \
BIN_CLUSTER_POLISHED_REF_PRODIGAL_TXT),
lambda w: expand_template_from_bin_clusters(w, \
BIN_CLUSTER_POLISHED_REF_PRODIGAL_STATS),
lambda w: expand_template_from_bin_clusters(w, \
BIN_CLUSTER_POLISHED_POL_VS_REF_STRANDS),
lambda w: expand_template_from_bin_clusters(w, \
BIN_CLUSTER_POLISHED_POL_VS_REF_STRAND_ANNOTS),
. . .
I've updated the code and tested both using the all
rule and the steps. Let us know if you have more issues.
It may be overly optimistic, but I'm hoping 2 days of silence means it worked for you. I'll go ahead and close this issue, but let us know if you have any more problems.
Hi jmeppley, thanks for your following on this issue and I am really appreciating that! Actually I am still not familiar with linux and python, so it took me several long time. With your upgraded file and test data, I re-installed the pipeline and ran your test data. Everything was fine this time and no bug was reported. However, once I tried my own data, it was reported 'align_cluster_reads' file was missing(attached picture shows all the file under "/home/chengzw/CZW_disk/groundwater_nanopore/dtr_output/market/1D/v2/kmer_binning/refine_bins/") in my market.log file. Also, I tried step-by-step approach and it reported the same missing file in market_step5.log. Was it the reason of my data failing to form align_cluster_reads(not the error of your pipeline) so that the pipeline stopped? Could you give me some advice on this? Another thing I am courious about is that we have some assembled NGS data for virome but lacking summary.txt file, if we want to run those data with this pipeline, which step should we start or should we run it with an man-made summary.txt file by all rule command? Thanks for your sharing and contribution for this pipeline again! market.log market_step5.log
My first guess is that the reads simply didn't cluster.
Looking at the rule message:
rule combine_bin_cluster_read_info:
input: /home/chengzw/CZW_disk/groundwater_nanopore/dtr_output/market/1D/v2/kmer_binning/refine_bins/alignments/-1/-1.clust.info.csv, /home/chengzw/CZW_disk/groundwater_nanopore/dtr_output/market/1D/v2/kmer_binning/refine_bins/alignments/0/0.clust.info.csv, /home/chengzw/CZW_disk/groundwater_nanopore/dtr_output/market/1D/v2/kmer_binning/refine_bins/alignments/1/1.clust.info.csv
output: /home/chengzw/CZW_disk/groundwater_nanopore/dtr_output/market/1D/v2/kmer_binning/refine_bins/alignments/all_bins.clust.info.csv
There are only 3 input files from two bins (1 and 2) and on from the "leftovers" (-1). A run that returns sequences will often generate hundreds of bins.
The things to look at are
$ cat kmer_binning/refine_bins/alignments/1/1.clust.info.csv
bin_id,cluster,read_id,read_len,clust_read_score,frac_max_score
A successful bin looks like this:
$ cat kmer_binning/refine_bins/alignments/1/1.clust.info.csv
bin_id,cluster,read_id,read_len,clust_read_score,frac_max_score
1,2,0a27a363-ad7a-4d55-bcf7-7f5c61a6e329,38858,14.005,0.928
1,2,1b8deba0-9c0c-4912-a7f8-1a6ecc6165ed,39652,14.115,0.928
1,2,1cf11a42-8190-49ca-8ea6-ce42b9cc8792,39077,13.928,0.928
1,2,1e2409f0-5b14-46d7-972e-8b42b4996ec6,38663,13.899,0.928
1,2,200fe7c0-c02d-4a11-ac27-8e7853debdb6,38964,14.264,0.928
1,2,340fe1c5-63c6-4a26-824c-e74af25430ad,39029,14.167,0.928
1,2,4c21734b-756d-4f1e-9af9-ab8aec1665a0,38924,14.214,0.928
1,2,54083768-24ba-4c28-a5c9-835da9f17cdb,38793,14.108,0.928
1,2,5c9b5e83-6ae0-482c-84f7-f0756ce01952,39080,14.066,0.928
1,2,5d908650-b715-4954-a717-b1f8b852a714,38986,14.343,0.928
1,2,656191a3-1be8-49b0-b18b-39eb9d644a54,38944,14.243,0.928
1,2,68453e62-a6cb-4350-a187-78f7290786e4,38893,14.038,0.928
1,2,709b4909-0b4d-4a49-9a30-2e7b563923e3,39000,14.234,0.928
1,2,7221dc31-e4b3-4d1e-8808-b9d7a9b2e9e4,38942,14.28,0.928
1,2,73e1fa53-ded1-4aff-99d2-b92ac66e849b,39227,14.375,0.928
1,2,7b46b6f0-bca8-4e25-8f8d-d7c9cd5f9a25,38884,14.217,0.928
1,2,821f855a-0eb7-458d-99ff-a51d01d4c0f9,39296,13.885,0.928
1,2,838b45f9-148e-4054-a06a-bb23e678f153,39072,14.351,0.928
1,2,8b4c32a9-335a-4f90-ab0d-8e1b9796f000,39039,14.218,0.928
1,2,92900c07-c3b5-4457-bc41-5616f50c746e,38758,14.104,0.928
1,2,9938a099-abcf-4baf-acc6-864e07f4f530,38730,13.848,0.928
1,2,a5b7f827-e35a-4775-b88d-d05bde8351e8,38957,14.177,0.928
1,2,abd813ce-e05d-45a8-b0b1-82259e119a3b,38799,14.202,0.928
1,2,afa52ff7-1426-492d-8713-04231bec8280,38348,13.766,0.928
1,2,b222d51c-56e4-4537-9649-1e7c2f909085,39593,14.044,0.928
1,2,b42bb34f-5aba-4808-b9b5-93921aced5e6,38786,13.991,0.928
1,2,c9433761-1d05-4fd1-bf49-71e8ceb613df,39325,12.401,0.928
1,2,d2c7c799-23f5-44eb-8a98-80bfefcdd8de,39182,14.114,0.928
1,2,d80f2081-cb89-4006-9f1a-69c57e0f803e,39057,14.266,0.928
1,2,e36e0c7c-d605-4108-8f56-bee7a264c61f,39121,14.371,0.928
1,2,ebf972ce-109a-474d-8153-f89a6733b58d,39117,14.35,0.928
1,2,ec9f3d06-bceb-4c0b-9d06-2e99be39ae2c,38986,14.348,0.928
1,2,eea2d630-cc8b-4988-b5c9-881c52098217,39015,14.274,0.928
1,2,f39c84cd-2dc3-489b-8985-f58f09c4b8a6,38804,13.928,0.928
1,2,f61c682a-1438-4598-b6c9-6ee9b454f4e3,39070,14.34,0.928
1,2,f917fc99-7460-4895-8ef9-929e0dfa7f42,38876,14.124,0.928
1,2,f9350cec-f9fc-48eb-8ff7-996c24d32d81,38979,14.229,0.928
1,2,f9ffa342-d94f-481d-9b90-57bbc082830a,38846,14.05,0.928
1,2,fb4fe465-d7a8-471c-b5a6-ee2bb3c8bede,38976,14.065,0.928
1,2,fd57d0f2-c81f-45c9-bdd0-784bab0e8780,39007,14.219,0.928
Hi jmepply, I cat the '1.clust.info.csv' file and it's empty. Is there some alternative solution for this?
I think that's consistent with the reads simply not clustering.
Hi , thanks for your sharing of this DTR pipeline! I was runing this pipeline with my own nanopore data and encountered several errors that might need your help. First of all, I could execute step 1 command, but it seemed to fail to create unmap kmer freq map in "5 of 11 steps", with other 10 steps runing sucessfully. Secondly, I couldn't execute step 3 as it reported that 'NameError: name 'expand_teamplte_from_bins' is not defined'. I am not sure whether this has something to do with the error in step1. Thirdly, in each reason of step 1 log file, there is always "reason: Missing output files:", is this normal? I have upload my log file of step1,2,3 and my config.yml, could you help me check that? step1.log step2.log step3.log config.yml.txt