ruanjue / wtdbg2

Redbean: A fuzzy Bruijn graph approach to long noisy reads assembly
GNU General Public License v3.0
512 stars 94 forks source link

Generating lay.gz for unitigs #189

Closed fredzhou91 closed 4 years ago

fredzhou91 commented 4 years ago

Dear Jue Ruan,

I've finished the wtdbg2 and the results turned out to be good (judged by total size also N50).

However, I noticed that there are unitigs and contigs, as in the log there are lines starting with TOT after the unitigs, and Estimated: TOT after the contigs

Also, I noticed the number of contigs are dropping, and seemed that the Estimated: TOT is not corresponding to the exact contigs that survived in the ctg.lay.gz.

[] building contigs
[] searched XXX contigs
[] Estimated: TOT .....
[] output YYY contigs

YYY is smaller than XXX.

So I'm wondering whether it is possible to generate the lay.gz by taking the frg.dot.gz or 0.X.dot.gz, as I noticed there is a script file termed wtdbg-dot2gfa.pl, which seems to convert the dot files to GFA format.

(My guessing would be, there would be a script that can generates ctg.lay.gz by taking the ctg.dot.gz and information from other files, by which I may rescue some data)

Thank you very much!

Best, Fred

ruanjue commented 4 years ago
--ctg-min-length <int>
   Min length of contigs to be output, 5000
 --ctg-min-nodes <int>
   Min num of nodes in a contig to be ouput, 3

YYY is post-filtered number. Too short contigs are mostly duplicates caused by high error rate.

fredzhou91 commented 4 years ago

Oh I see, there are series of --load parameters.

Thank you!!! I''ll give a try.

Best, Fred

fredzhou91 commented 4 years ago

Hi,

I'm trying to re-do the analsyis by using the --load function

However, after I tried both loading nodes+clips, or loading the alignments, all reporting the errors for cannot find reads and ended up with empty output

 -- Cannot find read "m54084_170906_112213/56164781/ccs" in LINE:2 in load_nodes_graph -- wtdbg.h:2169 --
 -- Cannot find read "m54084_170627_173617/7602327/ccs" in LINE:3 in load_nodes_graph -- wtdbg.h:2169 --
 -- Cannot find read "m54084_170904_090439/33423798/ccs" in LINE:4 in load_nodes_graph -- wtdbg.h:2169 --
 -- Cannot find read "m54084_170728_181802/74908118/ccs" in LINE:5 in load_nodes_graph -- wtdbg.h:2169 --
 -- Cannot find read "m54084_170727_075635/41091360/ccs" in LINE:6 in load_nodes_graph -- wtdbg.h:2169 --
 -- Cannot find read "m54084_170630_060345/55509454/ccs" in LINE:7 in load_nodes_graph -- wtdbg.h:2169 --
 -- Cannot find read "m54084_170905_152245/25625500/ccs" in LINE:8 in load_nodes_graph -- wtdbg.h:2169 --
 -- Cannot find read "m54084_170519_032703/42730466/ccs" in LINE:9 in load_nodes_graph -- wtdbg.h:2169 --
 -- Cannot find read "m54084_170904_090439/38404181/ccs" in LINE:10 in load_nodes_graph -- wtdbg.h:2169 -- 

Please suggest whether there's any issue in the read names. Thank you very much!

Best, Fred

ruanjue commented 4 years ago

To load alignments, you MUST make sure the input reads are identidical to previous run. That is keeping the same -i --limit-input -L --tidy-name --rdname-filter --rdname-includeonly -X -g --rdcov-filter.

fredzhou91 commented 4 years ago

To load alignments, you MUST make sure the input reads are identidical to previous run. That is keeping the same -i --limit-input -L --tidy-name --rdname-filter --rdname-includeonly -X -g --rdcov-filter.

Dear Jue,

I followed your suggestion and still get empty results and the warnings:

 -- Cannot find read "m54084_170519_032703/31130428/ccs" in LINE:1 in load_alignments_core -- wtdbg.h:1250 --
 -- Cannot find read "m54084_170519_032703/31130428/ccs" in LINE:2 in load_alignments_core -- wtdbg.h:1250 --
 -- Cannot find read "m54084_170519_032703/31130428/ccs" in LINE:3 in load_alignments_core -- wtdbg.h:1250 --
 -- Cannot find read "m54084_170519_032703/31130428/ccs" in LINE:4 in load_alignments_core -- wtdbg.h:1250 --

Please find the code I used for previous run:

wtdbg2 \
-x sq \
-g 3g \
-i MY_ZIPPED_FASTA.gz \
-t 79 \
-p 17 \
-S 2 \
--edge-min 2 \
--rescue-low-cov-edges \
-R \
-o THE_OUTPUT \
> THE_OUTPUT.log 2>&1

I backup all output into another folder, and execute the following code to the same location where all the previous output files are still available.

1. --load-alignemnts
2. --ctg-min-length
3. --ctg-min-nodes

And here comes my code:

wtdbg2 \
-x sq \
-g 3g \
-i MY_ZIPPED_FASTA.gz \
-t 79 \
-p 17 \
-S 2 \
--edge-min 2 \
--rescue-low-cov-edges \
--load-alignments THE_.alignments.gz_produced_in_previous_run \
-R \
-o THE_SAME_OUTPUT \
--ctg-min-length 500 \
--ctg-min-nodes 1 \
> THE_NEW_OUTPUT.log 2>&1

And I compared the key parameters in the two runs, they are exactly the same:

Previous run: -k 0 -p 17 -K 1000.049988 -A -S 2.000000 -s 0.050000 -g 3000000000 -X 50.000000 -e 2 -L 5000

Re run: -k 0 -p 17 -K 1000.049988 -A -S 2.000000 -s 0.050000 -g 3000000000 -X 50.000000 -e 2 -L 5000

My only guess is, whether the "/" in the reads will affect the searching?

Thank you very much!

Best, Fred

ruanjue commented 4 years ago

wtdbg2 parses read name with regex '^(.+?)/[0-9]+_[0-9]+$' , if matched, it select the longest subread and discard the subreads's coordiates. So, please show a few reads' name from input sequences file and name in alignments.

fredzhou91 commented 4 years ago

Dear Jue,

Sorry!!!! I re-examine the files while found that the original read file has been modified.

I fixed the files and seems that everything is ok now!

It is really a fascinating tool! I will continue my work! Thank you very much!!!

Best, Fred