nanoporetech / DTR-phage-pipeline

Mozilla Public License 2.0
16 stars 6 forks source link

Any possible ways to got viral bins from assembled fasta? #7

Closed zhanwen-cheng closed 3 years ago

zhanwen-cheng commented 3 years ago

Hi jmeppley. If I have a long assembled fasta file without summary file, but I want to get the k-mer bins like your pipeline, what should I do? Would it work if I creat a virtual sequencing summary file?

jmeppley commented 3 years ago

I think so. Use the summary file in test/ as a guide. You'll only need three columns, but make sure the column headers are correct:

zhanwen-cheng commented 3 years ago

Hi jmeppley, I have tried make a virtual sequencing summary(attached below as sequencing.summary.txt) according to my assembled scaffold(yca.contigs.fasta.txt), and set my environmental yml file(as config.yml.txt). I ran the command nohup snakemake --use-conda --conda-frontend mamba -p -r -j 40 > yca23_assem.log, it report error when it run to "python /home/chengzw/software/DTR-phage-pipeline-mamba/scripts/kmer_freq.py -c -k 5 -t 40 /home/chengzw/CZW_disk/groundwater_nanopore/dtr_output/yca_23_assem/1D/v2/dtr_reads/output.dtr.fasta > /home/chengzw/CZW_disk/groundwater_nanopore/dtr_output/yca_23_assem/1D/v2/kmer_binning/kmer_comp.tmp", it seem no dtr.fasta have been found in my situation(showed in below picture). Can you help me chekc that? Is there something wrong I have set? I have put my config.yam, assemble fasta file, sequencing summary file, and my log file below. Thanks! err sequencing.summary.txt yca23.contigs.fasta.txt config.yml.txt yca23_assem.log

jmeppley commented 3 years ago

I missed the part where you are starting with assembled contigs. This workflow is expressly designed for unassembled long-reads. I fully expect it to fail at least two different ways if run on contigs.

First, DTRs (direct terminal repeats) are usually lost in assemblies because the assemblers can't distinguish between the two identical ends. I think this is what happened to you. The workflow simply found 0 DTRs in your contigs.

Secondly, and more importantly, It needs to find at least 10 reads that span the full length of the same genome. Although you can lower that cutoff a little bit, it needs to be significantly more than 1. Assembled data, by definition, is expected to have no duplicates of any one genome.

However, if all you want is to is bin sequences by kmer counts, you can do that. The workflow will most likely fail to produce any polished clusters from the bins, but it will count kmes and attempt to bin your sequences. Just skip the DTR detection by setting:

pre_filter: None

in config.yml.

zhanwen-cheng commented 3 years ago

Thanks! Actually I want to get the DTR containing bins, and the assembled fasta here is the nanopore sequenced environmental concentrated viral sample(lot of short np reads, short N50, thus I used canu to assemble them), so I prefered to keep pre_filter on. Any suggestions from you if I still want to try your pipeline in my short np reads situation(both for unassembled reads or assembled scaffold)? I remmember that in the previous version you put out step by step command(maybe not totally workable), can I get these commands from you? Thanks~

jmeppley commented 3 years ago

Based on the run log and empty output file, the workflow found no DTRs in your contigs.

I think the only thing left to try with our workflow is to adjust the DTR detection parameters.

I don't think it will change you results, but you can run the workflow in stages by using the partial rules found at the end of the main Snakefile. These are still supported in the current makefile, but they are not documented. The instructions for running the pipeline in stages can be found in an old version of the README.

zhanwen-cheng commented 3 years ago

OK, thanks a lot~