ruanjue / wtdbg2

Redbean: A fuzzy Bruijn graph approach to long noisy reads assembly
GNU General Public License v3.0
513 stars 94 forks source link

new parameters for wtdbg2 #45

Closed bitcometz closed 6 years ago

bitcometz commented 6 years ago

Hello, I noticed there are some new parameters in the latest version of wtdbg2:

(A) nanopore/ont: -p 19 -AS 2 -s 0.05 -L 10000 sequel/sq: -p 0 -k 15 -AS 2 -s 0.05 -L 10000

the parameter "A" was set in sequel and ONT reads. As mentioned before, the alignment of contained reads will have few affects on the assembly results, so why we have to keep all these alignments?

(B) -X Choose the best depth for layout (effective with -g) [50]

Does this parameter(-X 50) represent that we choose the longest 50 depth data to do the alignment and then perform the assembly, or do we use all the reads to do the alignment and then choose the best result of 50 depth alignment for assembly? How is this best defined?

Thanks!

ruanjue commented 6 years ago

A) Heng and I are tuning parameters to make the default presets work well for various seq types , genome size. If you play well with those parameters, please set them separately. -x presets aims to provide a quick start for new users.

B) It is the first. I will chnage it to ' -X Choose the best depth from input reads (effective with -g) [50]'

Jue

bitcometz commented 6 years ago

I get it now, thanks!

bitcometz commented 6 years ago

And I have two opinions and I don't know whether they are right or wrong

(1) And for PacBio reads, the longest is not alway the best according to their experience:

Filtering options for your input data for pre-assembly can also be set with the pa_fasta_filter_option flag. The default is streamed-internal-median which uses the median-length subread for each ZMW (sequencing reaction well). Choosing the longest subread can lead to an enrichment in chimeric molecules. Users will rarely need to change this option from the default.

streamed-internal-median Applies the median-length ZMW filter only on internal subreads (ZMWs with >= 3 subreads) by running a single pass over the data. The input subreads should be groupped by ZMW. For ZMWs with < 3 subreads, the maximum-length one is selected.

https://github.com/PacificBiosciences/pb-assembly

(2) And suppose there are some genome regions which tends to be easily fragmented during DNA extraction. With the sequencing price reducing rapidly, it is routine that a genome will be sequenced more than 100x. In this case, choosing the longest 50x reads might affect the assembly coverage?

ruanjue commented 6 years ago

1) Thanks so much for the suggestion. I will try.

2) -X can be specified in command line, -X 100 will cope with it.

ruanjue commented 6 years ago

https://github.com/ruanjue/wtdbg2/commit/c807560311b459f81920fb4a77a5b322403f01e3

I have add an option --rdcov-filter <0|1> to choose longest or median reads. Please have a try!

bitcometz commented 6 years ago

Thanks ! I will have a try and evaluate the result

bitcometz commented 6 years ago

hello, the following is a simple test on the Ath 7Gbp data:

  1. --rdcov-filter 0 Estimated: TOT 127424000, CNT 516, AVG 246946, MAX 11241984, N50 6586368, L50 8, N90 196864, L90 51, Min 512
  2. --rdcov-filter 1 Estimated: TOT 127467008, CNT 524, AVG 243258, MAX 11241728, N50 6586368, L50 8, N90 212736, L90 51, Min 512

There is not much difference between the two test results.

Best,